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2006 values from NIST. For more physical constants, see http://physics.nist.gov/cuu/Constants/ . 


Speed of light in vacuum 
Boltzmann constant 


Stefan-Boltzmann constant 
Relative standard uncertainty 


Avogadro constant 
Relative standard uncertainty 


Molar gas constant 
Electron mass 

Proton mass 
Proton/electron mass ratio 
Elementary charge 
Electron g-factor 

Proton g-factor 

Neutron g-factor 

Muon mass 

Inverse fine structure constant 
Planck constant 

Planck constant over 27 
Bohr radius 


Bohr magneton 


Reviews 


c = 299 792 458 ms! (exact) 
k = 1.380 6504(24) x 10°73 JK! 


o = 5.670 400(40) x 10°° W m? K+ 
+7.0 x 10° 


Na, L = 6.022 141 79(30) x 107 mor! 
45.0 x 10° 


R= 8.314 472(15) J mol! K7 

Me = 9.109 382 15(45) x 103! kg 
Mp = 1.672 621 637(83) x 102” kg 
MylMe = 1836.152 672 47(80) 

e = 1.602 176 487(40) x 10°C 
Be = —2.002 319 304 3622(15) 

8p = 5.585 694 713(46) 

gw = 3.826 085 45(90) 

my = 1.883 531 30(11) x 10% kg 
a! = 137.035 999 679(94) 

h = 6.626 068 96(33) x 104 J s 

h = 1.054 571 628(53) x 1043 5 
do = 0.529 177 208 59(36) x 107° m 
bn = 927.400 915(23) x 10°63 T! 


““.. most excellent tensor paper.... I feel I have come to a deep and abiding understanding of relativistic 
tensors.... The best explanation of tensors seen anywhere!” -- physics graduate student 
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From OAD: sin = opp / hyp 
/ cos = adj / hyp 
sin? + cos? = 1 
From OAB: tan = opp / adj 
tan? + 1 = sec? 
(and withOAD) tan =sin/ cos 
sec = hyp / adj = 1 / cos 
From OAC: Va cot = adj / opp 
cot? + 1 =<se? 
(and withOAD) cot =cos/ sin 
csc = hyp / opp = 1/ sin 


O sec a 


Copyright 2001 Inductive Logic. All rights reserved. 
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1 Introduction | 


Mathematical Physics, or Physical Mathematics? 


Is There Another Kind of Physics? Mathematical Physics is devoted to the natural emergence of 
mathematics from our curiosity about the universe around us. All physics is mathematical, but Mathematical 
Physics illustrates that math is not abstract, or capricious, but an inescapable part of the natural world. 
Despite its humble beginnings rooted in conceptual understanding and the practice of science, many find that 
Mathematical Physics holds a beauty and fascination all its own. 


As with all “Funky” notes, we emphasize the physical meaning of the underlying concepts. For example, 
we stress a coordinate-free, geometric approach to vector operations. 


Why Physicists and Mathematicians Argue 


Physics goals and mathematics goals are antithetical. Physics seeks to ascribe meaning to mathematics 
that describe the world, to “understand” it, physically. Mathematics seeks to strip the equations of all physical 
meaning, and view them in purely abstract terms. These divergent goals set up a natural conflict between the 
two camps. Each goal has its merits: the value of physics is (or should be) self-evident; the value of 
mathematical abstraction, separate from any single application, is generality: the results can be used on a 
wide range of applications. 


Why Funky? 


The purpose of the “Funky” series of documents is to help develop an accurate physical, conceptual, 
geometric, and pictorial understanding of important physics topics. We focus on areas that don’t seem to be 
covered well in most texts. The Funky series attempts to clarify those neglected concepts, and others that 
seem likely to be challenging and unexpected (funky?). The Funky documents are intended for serious 
students of physics; they are not “popularizations” or oversimplifications. 


Physics includes math, and we’re not shy about it, but we also don’t hide behind it. 


Without a conceptual understanding, math is gibberish. 


This work is one of several aimed at graduate and advanced-undergraduate physics students. Go to our 
web page (in the page header) for the latest versions of the Funky Series, and for contact information. We’re 
looking for feedback, so please let us know what you think. 


How to Use This Document 


This work is not a text book. 


There are plenty of those, and they cover most of the topics quite well. This work is meant to be used 
with a standard text, to help emphasize those things that are most confusing for new students. When standard 
presentations don’t make sense, come here. 


You should read all of this introduction to familiarize yourself with the notation and contents. After that, 
this work is meant to be read in the order that most suits you. Each section stands largely alone, though the 
sections are ordered logically. Simpler material generally appears before more advanced topics. You may 
read it from beginning to end, or skip around to whatever topic is most interesting. The “Shorts” chapter is 
a diverse set of very short topics, meant for quick reading. 


If you don’t understand something, read it again once, then keep reading. 


Don’t get stuck on one thing. Often, the following discussion will clarify things. 


The index is not yet developed, so go to the web page on the front cover, and text-search in this document. 
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Thank You 


I owe a big thank you to many professors at both SDSU and UCSD, for their generosity even when I 
wasn’t a real student: Dr. Herbert Shore, Dr. Peter Salamon, Dr. Arlette Baljon , Dr. Andrew Cooksy, Dr. 
George Fuller, Dr. Tom O’Neil, Dr. Terry Hwa, and others. 


Scope 


What This Text Covers 


This text covers some of the unusual or challenging concepts in graduate mathematical physics. It is 
also very suitable for upper-division undergraduate level, as well. We expect that you are taking or have 
taken such a course, and have a good text book. Funky Mathematical Physics Concepts supplements those 
other sources. 


What This Text Doesn’t Cover 


This text is not a mathematical physics course in itself, nor a review of such a course. We do not cover 
all basic mathematical concepts; only those that are very important, unusual, or especially challenging 
(funky?). 


What You Already Know 


This text assumes you understand basic integral and differential calculus, and partial differential 
equations. Further, it assumes you have a mathematical physics text for the bulk of your studies, and are 
using Funky Mathematical Physics Concepts to supplement it. 


Notation 

Sometimes the variables are inadvertently not written in italics, but I hope the meanings are clear. 
2? refers to places that need more work. 
TBS To be supplied (one hopes) in the future. 


Interesting points that you may skip are “asides,” shown in smaller font and narrowed margins. Notes to myself 
may also be included as asides. 


[RRR Ramee amram RECREATE IP Pa a a A PP Pa a Pa a a a a a A ERE EERE 


Formulas: We write the integral over the entire domain as a subscript “oo”, for any number of 
dimensions: 


ID: [dx 3D: | a°sx 
Evaluation between limits: we use the notation [function]. to denote the evaluation of the function 
between a and J, i.e., 
[x]a? = fb) — fa). For example, Jo! 3x? dx = [Jo = 3 - OF = 1. 
We write the probability of an event as “Pr(event).” 


Column vectors: Since it takes a lot of room to write column vectors, but it is often important to 
distinguish between column and row vectors, I sometimes save vertical space by using the fact that a column 
vector is the transpose of a row vector: 


a 
b T 
=(a,b,c,d 
(abcd) 
d 
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Random variables: We use a capital letter, e.g. X, to represent the population from which instances of 
a random variable, x (lower case), are observed. In a sense, X is a representation of the PDF of the random 
variable, pdfx(x). 


We denote that a random variable X comes from a population PDF as: X € pdfy, e.g.: X € y(n). To 


denote that X is a constant times a random variable from pdfy, we write: X € k pdfy, e.g. X € k y(n). 


For Greek letters, pronunciations, and use, see Quirky Quantum Concepts. Other math symbols: 


Symbol Definition 


a 
=> 
—- 
© 
oe) 


if and only if 
proportional to. E.g., a « b means “a is proportional to b” 
perpendicular to 


ia therefore 

[= [ofthe onder of eometines usd imprecsey ax “approximately ual) 
[= [is defined ass dency qual o Ge, equal mall caes) 
Og 
2 
[@ [tensor product ska outerprogucte SSCS 
|e [directsum 


direct sum 


In mostly older texts, German type (font: Fraktur) is used to provide still more variable names: 


German German 
Latin Capital Lowercase Notes 


Distinguish capital from U, V 
Distinguish capital from E, G 
Distinguish capital from O, Q 
Distinguish capital from C, G 
Distinguish capital from C, E 
Capital almost identical to J 
Capital almost identical to I 
m Distinguish capital from W 
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Cc 
0 
j 
y 
i 
j 
t 
[ 


pe le 
oe ee Se 
Leis 
ee le 
| es 
pF; a] 
ae 
ae 
poe ee 
ee 2 
a a 
a 
[Ea |) | 
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n 


Distinguish capital from D, Q 


| ere 
a 
[eee 
r ESLOW ees AN 


Distinguish capital from A, U 
Distinguish capital from M 
Distinguish lowercase from r 


u 


yy 


Lo 
Les 
i 
bo 
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2 Random Short Topics | 


| Always Lie 


Logic, and logical deduction, are essential elements of all science. Too many of us acquire our logical 
reasoning abilities only through osmosis, without any concrete foundation. Unfortunately, two of the most 
commonly given examples of logical reasoning are both wrong. I found one in a book about Kurt Gédel (!), 
the famous logician. 


Fallacy #1: Consider the statement, “I always lie.” Wrong claim: this is a contradiction, and cannot be 
either true or false. Right answer: this is simply false. The negation of “I always lie” is not “I always tell the 
truth;” it is “I don’t always lie,” equivalent to “I at least sometimes tell the truth.” Since “I always lie” cannot 
be true, it must be false, and it must be one of my (exceedingly rare) lies. 


Fallacy #2: Consider the statement, “The barber shaves everyone who doesn’t shave himself. Who 
shaves the barber?” Wrong answer: it’s a contradiction, and has no solution. Right answer: the barber shaves 
himself. The original statement is about people who don’t shave themselves; it says nothing about people 
who do shave themselves. If A then B; but if not A, then we know nothing about B. The barber does shave 
everyone who does not shave himself, and he also shaves one person who does shave himself: himself. To 
be a contradiction, the claim would need to be something like, “The barber shaves all and only those who 
don’t shave themselves.” 


Logic matters. 


What’s Hyperbolic About Hyperbolic Sine? 


y 
A 


area = al2 


From where do the hyperbolic trigonometric functions get their names? By analogy with the circular 
functions. We usually think of the argument of circular functions as an angle, a. But in a unit circle, the area 
covered by the angle a is a/ 2 (above left): 


a a 
area = -——ar* =— (r=1). 


20 2 


Instead of the unit circle, x” + y* = 1, we can consider the area bounded by the x-axis, the ray from the origin, 
and the unit hyperbola, x? — y? = 1 (above right). Then the x and y coordinates on the curve are called the 
hyperbolic cosine and hyperbolic sine, respectively. Notice that the hyperbola equation implies the well- 
known hyperbolic identity: 


x =cosha, y =sinha, x“-y  =1 => cosh?— sinh? =1. 


Proving that the area bounded by the x-axis, ray, and hyperbola satisfies the standard definition of the 
hyperbolic functions requires evaluating an elementary, but tedious, integral: (?? is the following right?) 
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a 1 a 2 
area =— =—xy- dx Use: =yVx°-1 
a) y aa yy: 
a=xvx?-1-2[ Vx? =1 de 
1 
For the integral, let x=secO, dx=tanOsecO dd —e y= sec? 0-l1=tand 
ik Vx? -ldx= [, sec” 0-1 tan@secO dO= ih tan? Osec@ dO = * sin” @ dé 
1 cos? 6 


We try integrating by parts (but fail): 


U =tand dV =secOtand d@ => 


A, £9) x [3 
iF tan“ Osec@ V dU =sec Otan Of, -{ sec O0d0 


This is too hard, so we try reverting to fundamental functions sin( ) and cos( ): 


iz 1 _ 
U =sin0 dV =cos > Osin6 dé => dU =cos@ dd, Va -9 
Poe 2 a x % < x 
2] sin? 40 =20v -2| VdU= = -| cos” 6 cos0 dO Use: — =sec Otan 0 = xy 
! cos’ @ cos’ A], #1 cos’ A]; 


xtvx? -1 


- x 
=xy-[ secO dO = xy ~(In|sec 0 + tan 4))} =ay-(in 


xev? =i] 
stv? —1]=In 
ef =x4+Vx"-1 


Solve for x in terms of a, by squaring both sides: 


24 = x7 4 2xV x? -14-x7 ~1=2x( x44 -1)- 1=2xe* -1 


LH] 


of 


y 


=xy-In 


a=xy-xy+In 


stv? =| 


4 41 =2xe% 
e+e % ) 


2 


e+e % =2x > x =cosha = 


The definition for sinh follows immediately from: 


cosh?— sinh? = x” ye=l> y= x =1 


fe? 2 9 4g [er" ~2te™ 


sinha = y= 
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Basic Calculus You May Not Know 


Amazingly, many calculus courses never provide a precise definition of a “limit,” despite the fact that 
both of the fundamental concepts of calculus, derivatives and integrals, are defined as limits! So here we go: 


Basic calculus relies on 4 major concepts: 
1. Functions 

2. Limits 

3. Derivatives 

4. Integrals 


1. Functions: Briefly, (in real analysis) a function takes one or more real values as inputs, and produces 
one or more real values as outputs. The inputs to a function are called the arguments. The simplest case is 
a real-valued function of a real-valued argument e.g., f(x) = sin x. Mathematicians would write (f: R! > 
R'), read “f is a map (or function) from the real numbers to the real numbers.” A function which produces 
more than one output may be considered a vector-valued function. 


2. Limits: Definition of “limit” (for a real-valued function of a single argument, f: R' > R!): 


Lis the limit of f(x) as x approaches a, iff for every ¢ > 0, there exists a 6 (> 0) such that |f(x) — L| < ¢ whenever 
0 < |x—a| <0. In symbols: 


L= lim f(x) iff Ve >0,46 such that |f(x)-L|< € whenever 0< |x-a|< 6 ; 
xa 
This says that the value of the function at a doesn’t matter; in fact, most often the function is not defined at 
a. However, the behavior of the function near a is important. If you can make the function arbitrarily close 
to some number, L, by restricting the function’s argument to a small neighborhood around a, then L is the 
limit of fas x approaches a. 


Surprisingly, this definition also applies to complex functions of complex variables, where the absolute 
value is the usual complex magnitude. 


. Ix =2 
Example: Show that lim =4 
x>1 x-l 


Solution: We prove the existence of 6 given any ¢ by computing the necessary 6 from ¢. Note that for 


2x? -2 
x #1, ut = 2(x+1). The definition of a limit requires that 
2 
zea = —4|<e whenever 0<|x-I|<6. 
eo 


We solve for x in terms of ¢, which will then define 6 in terms of ¢. Since we don’t care what the function is 
at x = 1, we can use the simplified form, 2(x + 1). When x = 1, this is 4, so we suspect the limit = 4. Proof: 


|2Qx+)-4/<e => 2|\(x+I-2;<e => b-]<5 or ie <aele 
So by setting 6 = ¢/2, we construct the required 6 for any given ¢. Hence, for every ¢, there exists a 6 satisfying 
the definition of a limit. 


3. Derivatives: Only now that we have defined a limit, can we define a derivative: 


f(x+ Ax) — fx) 
ie P 


f'@) = lim 
Ax->0 


4. Integrals: A simplified definition of an integral is an infinite sum of areas under a function divided 
into equal subintervals (Figure 2.1, left): 
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b oN : 
li f(a) dx= ce 7 2 f C - a= (simplified definition) . 
Ax 


For practical physics, this definition is fine. For mathematical preciseness, the actual definition of an integral 
is the limit over any possible set of subintervals [ref??], so long as the maximum of the subinterval size goes 
to zero. This maximum size is called “the norm of the subdivision,” written as ||Ax;|: 


b 
| f(x) adel im Yrs f(x (precise definition) . 
a 


Ax;|| +0 “= 


Figure 2.1 (Left) Simplified definition of an integral as the limit of a sum of equally spaced 
samples. (Right) Precise definition requires convergence for arbitrary, but small, subdivisions. 


Why do mathematicians require this more precise definition? It’s to avoid bizarre functions, such as: 
J(x) is 1 if x is rational, and zero if irrational. This means f(x) toggles wildly between 1 and 0 an infinite 
number of times over any interval. However, with the simplified definition of an integral, the following are 
both well defined: 


3.14 
J, f(x) dx =3.14, and , f(x) dx=0 — (with simplified definition of integral) . 


In contrast, with the mathematically precise definition of an integral, both integrals are undefined. (There 
are other types of integrals defined, but they are beyond our scope.) 


The Product Rule 


Given functions U(x) and V(x), the product rule (aka the Leibniz rule) says that for differentials, 
d(UV)=UdV+VdU. (2.1) 
When U and V are functions of x, we have: 
d [U (x)V(x)] =U(x)V (x)dx +V(x)U (x)dx . 


This leads to integration by parts, which is mostly known as an integration tool, but it is also an important 
theoretical (analytic) tool, and the essence of Legendre transformations. 


Integration By Pictures 


We assume you are familiar with integration by parts (IBP) as a tool for performing indefinite integrals. 
We start with a brief overview, and then discuss a specific example in detail. IBP takes a non-trivial integral 
into an expression with a different integral, which may be easier to perform analytically: 


[sa@ar=fu dv -uv -[V dU where U =U(x),V =V(x) (2.2) 
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are parametric functions of x. The above comes directly from the product rule (2.1): UdV =d (U Vv) -—VdU 
, and integrate both sides. Inserting limits of integration makes for a simple illustration of the formula’s 
meaning (Figure 2.2a), but a slightly tedious equation: 

b V(b) b U(b) U(b) 
[ fe@ar=[ vav=[uv}_,-[ vay =u@ve)-U@V@-| vau 
a V(a) v=e U(a) _ 


- U(a) 
big rectangle — small rectangle 


where U =U(x),V =V(x). 


The figure plots U vs. V, where we’ve chosen U and V to be increasing parametric functions of x. In practice, 
the RHS of (2.2) is usually written in terms of x as: 


b b 
( U(x) V(x) dx =[UV0]._, - I V(x) U (x) de. (2.3) 
dv dU 


Note that x is the original integration variable (not U or V), so all the limits of integration are the original x = 
atox=b. 


As a specific example, consider: 


| xsinx dx. 
| ear 
Ff) 


2.7 
Figure 2.2b illustrates the definite integral | J (x) dx to scale, with uniform representative intervals dx. 
1 


U U 


U(a)V(a) 


V(a) V(b) 
(a) (b) 


Figure 2.2 (a) Schematic identification of significant features of IBP. (b) To scale: the original 
integral can be reconsidered as (c) an integral of U dV; the areas are equal. U and V are parametric 
functions of x; dV is a function of x and dx. As shown, when the dx are uniform, the dV are not. 


This integral is not immediate, so we can try integration by parts, though there is no guarantee that it will 
work. In this example, there are three ways of choosing U(x) and V(x): 


U(x)=xsinx, dV=dx => dU =(xcosx+sinx)dx, V(x) =x 
U(x)=x, dV =sinxdx => dU =dx, V(x) =-cosx 
U(x) =sinx, dV=xdx => dU =cosxdx, V(x)=x"/2 


More complicated integrals will have more choices for U(x) and V(x). It is hard to know ahead of time which 
choice (or choices) will succeed. However, looking at the RHS of (2.3), we see that it multiplies V and the 
derivative of U. Looking at our 3 choices above, on the RHS of the arrows, we find the two factors V dU 
that we would be faced with integrating: 


e the first choice has an ugly dU, and V dU cannot be easily integrated; 
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e the second choice has dU = dx, which literally could not be simpler, and V dU integrates easily; 
e the last choice has dU = cos x dx, which isn’t bad, but V dU cannot be easily integrated. 
Thus our best guess is the second choice (often, the simplest dU is a good choice). Figure 2.2c illustrates 


| U dV toscale; U and V are parametric functions of x; dV is a function of x and dx. Then: 


[ sin. x dx = -xc0s.x— | -cos.x dx =-xcosx+sinx . 
| ca | een (a 
UV V du 


We check by differentiating the RHS above, which yields the original integrand. 


Note that when the dx in Figure 2.2b are uniform, the dV in Figure 2.2c are not. However, all the dV go 
to zero when the dx do, so the integral of U dV is still valid. 


The term [U (Vo) is called the “boundary term,” or sometimes the “surface term.” 


U(b) 


JU dv =-JVdu 
UdV<0 
U(a) =0 H : ¥ 
V(b) =0 integration (a) V(a) = Vib) = 0 Vinax 
(a) direction (b) 


Figure 2.3 Two more cases of integration by parts: (a) V(x) decreasing to 0. (b) V(x) progressing 
from zero, to finite, and back to zero. 


More advanced cases of Integration By Parts: Figure 2.3a illustrates another common case: one in 
which the boundary term UV is zero. In this example, UV = 0 at x = a because U(a) = 0, and at x = b because 


V(b) = 0. This means V(x) decreases as x increases. Viewed as | U aV , all the dV < 0. The shaded “area” 


is therefore negative. Viewed (sideways) as | V dU , all the dU > 0 and the shaded area is positive. Thus: 


(x)dx=[Udv=-[VdU when [uv] =0, 
[r | | [uv], 


in agreement with (2.3). 


Figure 2.3b shows the case where UV = 0 at x = a and b, because one of U(x) or V(x) starts and ends at 
0. For illustration, we chose Via) = V(b) = 0. Then the boundary term is zero, and we again have: 


b b 
[Uiovwof’_, =0 = | uav=-[ vau. 
: x=a x=a 
For V(x) to start and end at zero, V(x) must grow with x to some maximum, Vmax, and then decrease back to 
0. For simplicity, we assume U(x) is always increasing. The V dU integral is the blue striped area to the left 
of the curve, and is > 0. The U dV integral is the area under the curves. We break the U dV integral into two 
parts: path 1, leading up to Vmax, and path 2, going back down from Vmax to zero. The integral from 0 to Vinax 
(path 1) is the red striped area; the integral from Vinax back down to 0 (path 2) is the negative of the entire 
(blue + red) striped area. Then the blue shaded region is the difference (< 0): 


(1) the (red) area below path | (where dV is positive, because V(x) is increasing), minus 
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(2) the (blue + red) area below path 2, where dV is negative because V(x) is decreasing. Thus 


[uav<o: 
V iridat 0 Vingx Vinax 

J ua =[“uav+] vav=[™uav-[™uav 
V=0 =Vinax V=0 V=0 

path\+ path2 path path 2 path1 path 2 


b 
=-[ Vdu. 
x=a 
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Theoretical Importance of IBP 


Besides being an integration tool, an important theoretical consequence of IBP is that the variable of 
integration is changed, from dV to dU. Many times, one differential is unknown, but the other is known: 


The classic example of this is deriving the Euler-Lagrange equations of motion from the principle of 
stationary action. The action of a dynamic system is defined by: 


S= | L(q(t),q(t)) dt, 


where the lagrangian is a given function of the trajectory q(t). Stationary action means that the action does 
not change (to first order) for small changes in the trajectory. I.e., given a small variation in the trajectory, 
oq(t): 


58 =0=] L(q+ 64,4 +54) dt -S =| Ee Loge 


The quantity in brackets involves both dq(f) and its time derivative, Sq(t). We are free to vary dq(t) 
arbitrarily, but that fully determines dg(t). We cannot vary both 6g and dq separately. We also know that 
Oq(t) = 0 at its endpoints, but dq(t) is unconstrained at its endpoints. Therefore, it would be simpler if the 


quantity in brackets were written entirely in terms of dq(f), and not in terms of 6g. This is easy: 


tise 59s os=0=f M593 OA a | ae, 
dt ago aq dt 


Now in the second term, IBP allows us to eliminate the time derivative of dg(t) (which is unconstrained) 
in favor of the time derivative of OL/0q (which we can easily find, since L(q,q) is given). Therefore, this 


is a good trade. Integrating the 2™ term in brackets by parts gives: 


Let dU =| —— |dt. eae dt, V=6q 
oq q dt 


04, 


= 
«6g dt = uv -[v du = Os 6 ~ ff Sa 
Lo 
oe — i 


The boundary term is zero because dgq(f) is zero at both endpoints. The variation in action 6S is now: 


5s =] EE ie Yate: 
0q_ dt Oq 


The only way 6S = 0 can be satisfied for any dq() is if the quantity in brackets is identically 0. Thus IBP has 
led us to an important theoretical conclusion: the Euler-Lagrange equation of motion. 


This fundamental result has nothing to do with evaluating a specific difficult integral. IBP: it’s not just 
for hard integrals any more. 
Delta Function Surprise: Coordinates Matter 


Rarely, one needs to consider the 3D 6-function in coordinates other than rectangular. The coordinate- 
free 3D 6-function is written 6°(r —r’). For example, in 3D Green functions, whose definition depends on a 
6°-function, it may be convenient to use cylindrical or spherical coordinates. In these cases, there are some 
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unexpected consequences [Wyl p280]. This section assumes you understand the basic principle of a 1D and 
3D 6-function. (See the introduction to the delta function in Quirky Quantum Concepts.) 


Recall the defining property of 5°(r - r’): 
| dr O(r—r')=1 Vr' (V="forall") => } dr Or —r') f(r) = f(r). 


The above definition is “coordinate free,” i.e. it makes no reference to any choice of coordinates, and is true 
in every coordinate system. As with Green functions, it is often helpful to think of the d-function as a function 
of r, which is zero everywhere except for an impulse located at r’. As we will see, this means that it is 
properly a function of r and r’ separately, and should be written as 6°(r, r’) (like Green functions are). 


Rectangular coordinates: In rectangular coordinates, however, we now show that we can simply break 
up 6°(x, y, z) into 3 components. By writing (r—r’) in rectangular coordinates, and using the defining integral 
above, we get: 


r-r'=(x-x',y—y'z-2) => } dx| dy) dz 5°(x-x'y-yz-z)=1 


=> 5 (x-xy-yyz-z) =d(x-x)5(y- y5(Z-2'). 
In rectangular coordinates, the above shows that we do have translation invariance, so we can simply write: 
5° (x, ¥52) = 5()5(y)5(z)- 


In other coordinates, we do not have translation invariance. Recall the 3D infinitesimal volume element 
in 4 different systems: coordinate-free, rectangular, cylindrical, and spherical coordinates: 


d°r = dx dy dz =r dr dd dz =r’ sinO dr dO d¢. 


The presence of r and @ imply that when writing the 3D 5-function in non-rectangular coordinates, we must 
include a pre-factor to maintain the defining integral = 1. We now show this explicitly. 


Cylindrical coordinates: In cylindrical coordinates, for r > 0, we have (using the imprecise notation of 
[Wy] p280]): 


r—-r'=(r-r',d-¢',z-2z) > 


[arf aol” dere r',o@—-¢'z-z)=1 


= SP @=p=9'2-29=— 66 =)0G=9E=2, 50 
r 


Note the 1/r' pre-factor on the RHS. This may seem unexpected, because the pre-factor depends on the 
location of 6°( ) in space (hence, no radial translation invariance). The rectangular coordinate version of 6°( ) 
has no such pre-factor. Properly speaking, 5°() isn’t a function of r—r'; it is a function of r and r' separately. 


At r' = 0, the pre-factor blows up, so we need a different pre-factor. We’d like the defining integral to 
be 1, regardless of ¢, since all values of ¢ are equivalent at the origin. This means we must drop the 
0(¢— @’), and replace the pre-factor to cancel the constant we get when we integrate out ¢: 
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oO 2a 00 
I, ar| dol dzr &°(r r',g-@',z2-z)=1, r'=0 


= O(r-r'o-$'z Se cee ay r'=0, 
2ar 


assuming that I dr O(r)=1. 
0 


This last assumption is somewhat unusual, because the 6-function is usually thought of as symmetric about 
0, where the above radial integral would only be “2. The assumption implies a “right-sided” 6-function, 
whose entire non-zero part is located at O*. Furthermore, notice the factor of 1/r in 
O(r—0, z—z’). This factor blows up at r = 0, and has no effect when r #0. Nonetheless, it is needed because 
the volume element r dr d¢ dz goes to zero as r — 0, and the 1/r in d(r — 0, z— z’) compensates for that. 


Spherical coordinates: In spherical coordinates, we have similar considerations. First, away from the 
origin, 7’ > 0: 


(oe) a 2n 2. 3 ; ; 
I, ar dol. dé r?sind &(r—-r',0-0',6-¢) =1 > 


S(r r',0 0g ¢') = a r')d(@ O'6(g ?’), r'>0O. [Wy] 8.9.2 p280] 
r- sin 


Again, the pre-factor depends on the position in space, and properly speaking, 6°( ) is a function of r, r’, 0, 
and @” separately, not simply a function of r—r’ and 6-6”. At the origin, we'd like the defining integral to 
be 1, regardless of gor 8. So we drop the 6(¢— ¢’) 6(@ — @”), and replace the pre-factor to cancel the constant 
we get when we integrate out ¢ and 6: 


(oe) a 20 
| dr| do| dd r*sin0 53(r-0,0-6',6-$) =1, r'=0 
0 0 0 


1 


Anr? 


> 5°(r-0,0-0'.6-¢) = O(r), r'=0, 


assuming that | dr O(r)=1. 
0 


Again, this definition uses the modified d(r), whose entire non-zero part is located at 0*. And similar to the 
cylindrical case, this includes the 1/r* factor to preserve the integral at r = 0. 


2D angular coordinates: For 2D angular coordinates 6 and ¢, we have: 


a 20 
I, do\ d¢sin0 62(0-0',6-¢)=1,  '>0 


> 8@-6'.¢-¢)=——6-6)5-$), 6'>0. 
sind 


Once again, we have a special case when 6” = 0: we must have the defining integral be | for any value of ¢. 
Hence, we again compensate for the 2z from the ¢ integral: 


a 20 
J, do|. désin0 62(0-6',-¢)=1,  6'=0 


> 5°(0-0,¢-¢')= 50), 6'=0. 


27 sinO 


Similar to the cylindrical and spherical cases, this includes a 1/(sin @) factor to preserve the integral at 0 = 0. 
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Spherical Harmonics Are Not Harmonics 


See Funky Electromagnetic Concepts for a full discussion of harmonics, Laplace’s equation, and its 
solutions in 1, 2, and 3 dimensions. Here is a brief overview. 


Spherical harmonics are the angular parts of solid harmonics, but we will show that they are not truly 
“harmonics.” A harmonic is a function which satisfies Laplace’s equation: 


Vor) =0, with r typically in 2 or 3 dimensions. 


Solid harmonics are 3D harmonics: they solve Laplace’s equation in 3 dimensions. For example, one 
form of solid harmonics separates into a product of 3 functions in spherical coordinates: 


O(r, 0,9) = R(r)P(A)Q() = (4! + Br) ) Rumcos 0)(C, sinm@ + D, cosm®@) 


where R(r)=A,r'+B,r("*) js the radial part, 
P(O@) = P,, (cos 8) is the polar angle part, the associated Legendre functions, 
Q(@) = (C, sinm@ + D, cos mp) is the azimuthal part . 


The spherical harmonics are just the angular (0, ¢) parts of these solid harmonics. But notice that the 
angular part alone does not satisfy the 2D Laplace equation (i.e., on a sphere of fixed radius): 


2 
V= o(7 3s = = (sino + ; - but for fixed r: 
r- Or or) rsin@ 00 00) rsin’ 0 0¢ 


_1}o1 S (sino To 
r-| sin@ 06 20) sin? 0 dg | 


However, direct substitution of spherical harmonics into the above Laplace operator shows that the result is 
not 0 (we let r= 1). We proceed in small steps: 


2 
O() = Csin mp + Dcos mg => O(¢) =-mQ(4). 
For integer m, the associated Legendre functions, P;,,(cos 0), satisfy, for given / and m: 
1 6(. ,@ (+1), 
sind P,,, (cos @) =| — +m |P,,(cos@) . 
r° sin@ mal 5) in( ‘ in ) 


Combining these 2 results (r = 1): 


v2(P(@)0(¢)) =| — = (sino) !_ © |p@o)) 
sin @ 00 20) sin? 6 d¢° 


7 (-1 (1+1)+m? )Pin (cos @)Q(0) — m7P,, (cos 0)O($) 


=-I(1+1)P,,, (cos @)Q(g) 


Prono n nnn nnn nnn nnn nnn 
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The Binomial Theorem for Negative and Fractional Exponents 


You may be familiar with the binomial theorem for positive integer exponents, but it is very useful to 
know that the binomial theorem also works for negative and fractional exponents. We can use this fact to 


1 
easily find series expansions for things like i and Vl+x = (1 + x)? : 
—Xx 


First, let’s review the simple case of positive integer exponents: 
= n(n-l) , n(n-1)(n-2) | n! 
(a+b)" =q"p°® qo gi Ip] + ( an 272 ( )( an 373 4. tgp” ; 
1-2 1-2-3 n!} 
[For completeness, we note that we can write the general form of the m' term: 


! 
th nN. n-mym 
ap” 


m” term =—__— ninteger >0; minteger,O<m<n.] 
(n—m)!m! 


But we’re much more interested in the iterative procedure (recursion relation) for finding the (m + 1)" term 
from the m' term, because we use that to generate a power series expansion. The process is this: 


1. The first term (m = 0) is always a"b° = a” , with an implicit coefficient Co = 1. 


2. To find Cri, multiply C,, by the power of a in the m™ term, (n — m), 
3. divide it by (m + 1), [the number of the new term we’re finding]: Cru = 


4. lower the power of a by 1 (to n —m), and 
5. raise the power of b by | to (m+ 1). 


This procedure is valid for all n, even negative and fractional n. A simple way to remember this is: 


For any real n, we generate the (m+ 1)" term from the m™ term 
by differentiating with respect to a, and integrating with respect to b. 


The general expansion, for any n, is then: 


n(n—-1)(n—-2)...n-—m+l1 
m" term = ( )K ) ( ) mpm nreal; minteger >0 


m! 
Notice that for integer n > 0, there are n+1 terms. For fractional or negative n, we get an infinite series. 


1 
Example 1: Find the Taylor series expansion of i Since the Taylor series is unique, any method 
—x 


we use to find a power series expansion will give us the Taylor series. So we can use the binomial theorem, 
and apply the rules above, with a = 1, b = (-x): 


— =(1+ (-x))" =1!4+ cy) ry x)! - a 13( xp ic (=1)(-2)(-3) a 


2 


=e xt ti tx +... 


Notice that all the fractions, all the powers of 1, and all the minus signs cancel. 


Example 2: Find the Taylor series expansion of V1+x =(1+ re . The first term is a!” = 1'?: 
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(14a! 8a Eres a sf 2) Areata s(-2)(-3) Te ie 


2 (1) 2\ 2 (1-2) 2\. 2 2 1-2-3) 
=3\t" 
ives ae wt ( yr Om 3) am 
2 8 48 2" m!\ 


where p!!= p(p-2)(p-4)...(2 or 1) 


When Does a Divergent Series Converge? 


Sometimes, a divergent series “converges.” Consider the infinite series: 


laeea taste 2. 


When is it convergent? Apparently, when |x| < 1. What is the value of the series when x = 2? “Undefined!” 
you say. But there is a very important sense in which the series does converge for x = 2, and it’s value is — 
1! How so? 


Recall the Taylor expansion around x = 0 (you can use the binomial theorem, see earlier section): 


2 


(1-x)! =l4+xtxetit x ti... 


1 — 
1-x 
This is exactly the original infinite series above. So the series sums to 1/(1 — x). This expression is defined 
for all x #1. And its value for x = 2 is —1. 


imaginary 


; ; real 
region of \. |.” 
convergence 


(a) 


Figure 2.4 Domain of 1/(1 —x) in the complex plane. The function is analytically continued around 
the pole at x = 1, which defines meaningful values of the function even when x is outside the region 
of convergence. 


Why is this important? There are cases in physics when we use perturbation theory to find an expansion 
of a number (or function, as in QFT) in an infinite series. Sometimes, the series appears to diverge. But by 
finding the analytic expression corresponding to the series, we can evaluate that analytic expression at values 
of x that make the series diverge. In many cases, the analytic expression provides an important and 
meaningful answer to a perturbation problem even outside the original region of convergence. This happens 
in quantum mechanics, and quantum field theory (e.g., [M&S 2010 p291t]). 


This is an example of analytic continuation in complex analysis. Figure 2.4 illustrates the domain of 
our function 1/(1 — x) in the complex plane. A Taylor series is a special case of a Laurent series, and anywhere 
a function has a Laurent expansion it is analytic. If we know the Laurent series (or if we know the values of 
an analytic function and all its derivatives at any one point), then we know the function everywhere, even for 
complex values of x. Here, the original series is analytic around x = 0, with a radius of convergence of 1. 
However, the process of extending a function that is defined in some region to be defined in a larger 
(complex) region, is called analytic continuation (see Complex Analysis, discussed elsewhere in this 
document). This gives our function meaningful values for all x # 1, such as x = 2. Thus analytic continuation 
through the complex plane allows us to “hop over’ the pole on the real axis, and define the function for real 
x > 1 (and for x <-1). 
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TBS: show that the sum of the integers 1+2+3+...= —1/12.?? 


Algebra Family Tree 


integral 
domain, 
or 

domain 


vector 
space 
Hilbert 
space 


Convoluted Thinking 


Finite or infinite set of elements and operator 
(-), with closure, associativity, identity element 
and inverses. Possibly commutative: 

a-b=c wia, b,c group elements 


Set of elements and 2 binary operators 

(+ and *), with: 

* commutative group under + 

* left and right distributivity: 
adb+c)=ab+ac, (a+b)c=ac+be 

¢ usually multiplicative associativity 


A ring, with: 

* commutative multiplication 

¢ multiplicative identity (but no inverses) 

* no zero divisors (= cancellation is valid): 
ab = Oonly ifa=0orb=0 


“rings with multiplicative inverses (& 
identity)” 

* commutative group under addition 

* commutative group (excluding 0) under 
multiplication. 

¢ distributivity, multiplicative inverses 
Allows solving simultaneous linear equations. 
Field can be finite or infinite 


* field of scalars 
* group of vectors under +. 


Allows solving simultaneous vector equations 
for unknown scalars or vectors. 
Finite or infinite dimensional. 


vector space over field of complex numbers 
with: 
* a conjugate-bilinear inner product, 
<av|bw> = (a*)b<v|w>, 
<vlw> = <wiy>* 
a, b scalars, and v, w vectors 
* Mathematicians require it to be infinite 
dimensional; physicists don’t. 


rotations of a square by n x 90° 


continuous rotations of an object 


integers mod m 


polynomials p(x) mod m(x) 


integers 


polynomials, even abstract polynomials, 
with abstract variable x, and coefficients 
from a “field” 


integers with arithmetic modulo 3 (or any 
prime) 
real numbers 


complex numbers 


physical vectors 


real or complex functions of space: 
Ax, Ys 2 
kets (and bras) 


real or complex functions of space: 
S%, Ys 2) 


quantum mechanical wave functions 


Convolution arises in many physics, engineering, statistics, and other mathematical areas. As examples, 
we here consider functions of time, but the concept of convolution may apply to functions of space, or 
anything else. Given two functions, f(t) and g(t), the convolution of f(t) and g(f) is a function of a time- 
displacement, At, defined by (Figure 2.5): 


( f* g)(At) = { = dt f(t)g(At—t) where the integral covers some domain of interest . 
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g(t) 


S(t) 


At 


Tv 


(f *g)(Ato) (f *g)(At)) (f *g)(Aty) 


(b) (c) (d) 


Figure 2.5 (a) Two functions, f(t) and g(t). (b) (f *g)(Afto), Ato < 0. (c) (f *g)(Ati), Ati > 0. 
(d) f*g)(Atz), At2 > At). The convolution is the magenta shaded area. 


When At < 0, the two functions are “backing into each other” (above left). When At > 0, the two functions 
are “backing away from each other” (above middle and right). 


As noted at the beginning, convolution is useful with a variety of independent variables besides time. 
E.g., for functions of space, f(x) and g(x), f*g(Ax) is a function of spatial displacement, Ax. 


Notice that convolution is 
(1) commutative: f*g=erf 
(2) linear in each of the two functions: 
f *kg=k(f*g)=()*8, and 
f*\ethj=fterf*h. 
The verb “to convolve” means “to form the convolution of.” We convolve f and g to form the convolution 
f*g. Some references use “®” for convolution: f ® g. 


Two Dimensional Convolution: Impulsive Behavior 


A translation invariant linear system (TILS) is completely described by its impulse response. For 
example, for small angles, equivalent to narrow fields of view, an optical imaging system is approximately a 
TILS. In optics, the impulse response is called the Point Spread Function, or PSF. To illustrate the use of 
convolution in a TILS, consider an optical imager (Figure 2.6). 


| Optical | _y Ot) 
| as fe 
x, 
t Ls ( x y, V, - 
u 
(a) object image (b) x,% image 


Figure 2.6 (a) Optical imager is a TILS. (b) Example image of 3 point sources, with a representative 
image point. Each source is spread out by the imager according to the PSF. The red arrow is the 
vector (x —u). 


The imager has finite resolution, so a point object is spread over a region in the image. For a point object 
at the origin with intensity O, the image has intensity distributed over space according to: 


I(x, y)=O- PSF (x, y) . 
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x and y are position coordinates, such as meters or microradians. We define the object coordinates (u, v) to 
be those of their image points (so we can ignore magnification). Then translation invariance says that for a 
point object at (u, v): 


I(x, y)=O- PSF(x-u,y-v). 


For incoherent sources, intensities add, so multiple point sources produce an image intensity that is the sum 
of the individual images (Figure 2.6b). Therefore, the PSF is real, and represents the intensity response 
function (rather than field amplitude). At each point on the image (x, y): 


I(x, y) = O,PSF (x — uy, y — Vv) + Op PSF (x — Uy, Y — Vo) + OZ3PSF (x — U3, y — V3) 
3 
=) PSF (x-u;,y-v)) 
i=l 


For a continuous object, each infinitesimal region of size (du, dv) around each point (u, v) is essentially a 
point source. The image is the infinite sum of images of all these “point” sources. Then the sum above 
becomes a continuous integral: 


I(x, y) = ff du dv O(u,v) PSF(x—u, y—v) =O ® PSF. (2.4) 
object 
This is the definition of a 2D convolution. Some references use “*” for convolution: O*PSF. 


In general, for a TILS: 


A convolution is an infinite sum of responses to a continuous input. 


Translation invariant linear systems are fully described by their impulse response (aka PSF). The 
output of such a system is the convolution of the input with the PSF. 


All of the above is true for arbitrary PSF, symmetric or not. Some systems exhibit symmetry, e.g. many 
optical systems are axially symmetric. In such a symmetric case, the arguments to the PSF may be negated, 
though we find such expressions misleading. 


For coherent systems, the PSF is generally complex, and it denotes the magnitude and phase of the light 
at the image relative to the object. Such a PSF represents the field amplitude response function (rather than 
intensity). 


In vector notation, the convolution (2.4) can be written: 


I(x) = ff d?u O(u) PSF(x—u) =O ® PSF . 
object 


Structure Functions 


The term “correlation” has two distinct meanings, both of which are used in astronomy: (1) correlation 
between random variables, and (2) correlation between functions (of space or of time). In both meanings, 
correlations are used to compare two things. For example, we might compare light, as a function of time, at 
point A in space with that at point B, i.e. J4(f) compared to Jp(4). If these intensities vary randomly in time, 
we might ask, how are the two related? 


Correlations between random variables: The correlation of two random variables (RVs) describes 
to what extent the two RVs are linearly related to each other. The correlation is quantified with a correlation 
coefficient », where p = 1 means the two RVs are actually identical. is proportional to the covariance of 
the RVs. Two uncorrelated RVs have no linear relationship (though they may be related in other ways), and 
p=0. (See Funky Mathematical Physics Concepts for details.) 


In many systems, there are an infinite number of RVs, one at each point in space. For example, above a 
telescope, at each atmospheric space point x, there may be a randomly-varying temperature T(t, x), index of 
refraction Mt, x), or optical phase @(f, x). The variations are over time. It is common that there are 
correlations between the RVs at different points in space. For two very nearby points, p is near 1: the two 
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RVs are almost identical. For far separations, p is near 0, because the two RVs are essentially unrelated. In 
general, at two points x; and x2, and near some time fo, using optical phase as an example, the two-point 
structure function is [Quirr eq. 1]: 


fo +T 
Dglj.X2)= (|X) — OX) ) =f defo.) -6x0)). 


T ~seconds 


The averaging duration is of the order of the exposure time, typically some seconds [Fried 1966 Sec I-IV]. 
The weather typically changes much slower, of order at least minutes. For translation invariant, isotropic 
systems, the above depends only on the spatial distance r = |x) - x2|. This defines a structure function of a 
single variable, the distance r: 


Dglr) = (|x) - 00x, +0)P’) 


T ~seconds 


Since the system is translation invariant, Ds can be evaluated at any choice of x;. Because the system is 
isotropic, Dy can be evaluated at any r such that |r| = r. 


Correlation Functions 


The correlation between two functions is a measure of how linearly related they are. The functions are 
often functions of time, or functions of space. A measure of their linear relationship is given by the integral 
of their product: 


Ceal a fose) ot 


Cael afer) or Cyfff ax see. 


It is often useful to compare the two function with some offset in one of them. Then the correlation is a 
function of this offset: 


Cara] de fe+r)e) or CH= [ff d?x fox nets). 


For two unrelated zero-mean functions, the correlation function is zero. 


It is frequently useful to compute the correlation of a function with an offset version of itself, called the 
autocorrelation function. For example, at a fixed instant in time, consider the temperature variations 
throughout the 3D atmosphere, 7(x). Then: 


Crp (t) = I I | ee T(x+r)T(x). 


We expect that nearby temperatures are similar, and that distant temperatures are unrelated. Since 7(x) is 
zero-mean, we expect the autocorrelation function to be large for small offset, and small for large offset. The 
distance at which the autocorrelation becomes small is a measure of the size of atmospheric volumes with 
similar temperature. A 2D or higher autocorrelation function is not necessarily isotropic. For example, the 
temperature may vary differently in the vertical direction than in horizontal ones. 
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3 Vectors | 


Small Changes to Vectors 


Projection of a Small Change to a Vector Onto the Vector 


" dr | r-r' mi e 
4 ar 
dr jr—-r'|=r—r'F ies 

dr = dlr| = 


(Left) A small change to a _ vector, and its projection onto the vector. 
(Right) Approximate magnitude of the difference between a big and small vector. 


It is sometimes useful (in orbital mechanics, for example) to relate the change in a vector to the change 
in the vector’s magnitude. The diagram above (left) leads to a somewhat unexpected result: 


dr -t =dr or (multiplying both sides by r and using r = rf) 
r-dr=rdr 
And since this is true for any small change, it is also true for any rate of change (just divide by dr): 


rer=rr 


Vector Difference Approximation 


It is sometimes useful to approximate the magnitude of a large vector minus a small one. (In 
electromagnetics, for example, this is used to compute the far-field from a small charge or current 
distribution.) The diagram above (right) shows that: 


fr—r']}=r|—r' 8, Ir] >> [r'| 


Why (Fr, 0, ¢) Are Not the Components of a Vector 


(r, 8, @) are parameters of a vector, but not components. That is, the parameters (r, 0, ¢) uniquely define 
the vector, but they are not components, because you can’t add them. This is important in much physics, e.g. 
involving magnetic dipoles (ref Jac problem on mag dipole field). Components of a vector are defined as 
coefficients of basis vectors. For example, the components v = (x, y, z) can multiply the basis vectors to 
construct v: 


V=XX+ VV+ ZZ 


There is no similar equation we can write to construct v from it’s spherical components (r, 8, ¢). Position 


vectors are displacements from the origin, and there are no fr, 6, @ defined at the origin. 

Put another way, you can always add the components of two vectors to get the vector sum: 

Let w = (a,b,c) rectangular components. Then v+w=(at+x)x+(b+y)¥+(ct+z)z 
We can’t do this in spherical coordinates: 


Let w =(V,,,9,,.,,) Spherical components. Then v+w#(r,+%y,0,+O,.4, + by) 


However, at a point off the origin, the basis vectors Fr, 6, @ are well defined, and can be used as a basis 


for general vectors. [In differential geometry, vectors referenced to a point in space are called tangent 
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vectors, because they are “tangent” to the space, in a higher dimensional sense. See Differential Geometry 
elsewhere in this document. ] 


Laplacian’s Place 


What is the physical meaning of the Laplacian operator? And how can I remember the Laplacian 
operator in any coordinates? These questions are related because understanding the physical meaning allows 
you to quickly derive in your head the Laplacian operator in any of the common coordinates. 


Let’s take a step-by-step look at the action of the Laplacian, first in 1D, then on a 3D differential volume 
element, with physical examples at each step. After rectangular, we go to spherical coordinates, because they 
illustrate all the principles involved. Finally, we apply the concepts to cylindrical coordinates, as well. We 
follow this outline: 


1. Overview of the Laplacian operator 


2. 1D examples of heat flow 
3. 3D heat flow in rectangular coordinates 
4. Examples of physical scalar fields [temperature, pressure, electric potential (2 ways)] 
5. 3D differential volume elements in other coordinates 
6. Description of the physical meaning of Laplacian operator terms, such as 
oT 2 OT O{ 20T 2 O0( 20T 
VT, Wa? ae Sake Ne r° —| r° — |}. 
or or Or or or or 


Overview of Laplacian operator: Let the Laplacian act on a scalar field T(r), a physical function of 
space, e.g. temperature. Usually, the Laplacian represents the net outflow per unit volume of some physical 
quantity: something/volume, e.g., something/m?. The Laplacian operator itself involves spatial second- 
derivatives, and so carries units of inverse area, say m”. 


1D Example: Heat Flow: Consider a temperature gradient along a line. It could be a perpendicular 
wire through the wall of a refrigerator (Figure 3.1a). It is a 1D system, i.e. only the gradient along the wire 
matters. 


wall wall current 
metal wire carrying wire 


refrigerator JJ warmer room refrigerator JJ warmer room 


: heat flow 
— 


temperature 
temperature 


(a) : a: (o) a: 


Figure 3.1 Heat condition (a) in a passive wire, and (b) in a heat-generating wire. 


Let the left and right sides of the wire be in thermal equilibrium with the refrigerator and room, at 2 C 
and 27 C, respectively. The wire is passive, and can neither generate nor dissipate heat; it can only conduct 
it. Let the 1D thermal conductivity be k = 100 mW-cm/C. Consider the part of the wire inside the insulated 
wall, 4 cm thick. How much heat (power, J/s or W) flows through the wire? 


dT 25C 


P=k—= (100 mW-cm/C )—— =625mW. 
dx 4cm 
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There is no heat generated or dissipated in the wire, so the heat that flows into the right side of any 
segment of the wire (differential or finite) must later flow out the left side. Thus, the heat flow must be 
constant along the wire. Since heat flow is proportional to dT/dx, dT/dx must be constant, and the temperature 
profile is linear. In other words, (1) since no heat is created or lost in the wire, heat-in = heat-out; (2) but 
heat flow ~ dT/dx; so (3) the change in the temperature gradient is zero: 


d (4)-0-4 
dx\ dx de 


(At the edges of the wall, the 1D approximation breaks down, and the inevitable nonlinearity of the 
temperature profile in the x direction is offset by heat flow out the sides of the wire.) 


Now consider a current carrying wire which generates heat all along its length from its resistance (Figure 
3.1b). The heat that flows into the wire from the room is added to the heat generated in the wire, and the sum 
of the two flows into the refrigerator. The heat generated in a length dx of wire is 

La? : pdx where p=resistance per unit length, and / P=const . 


In steady state, the net outflow of heat from a segment of wire must equal the heat generated in that segment. 
In an infinitesimal segment of length dx, we have heat-out = heat-in + heat-generated: 


aT aT 2 
Pout = Fin + Peen > a as +I p dx 
dx a dx a+dx 
T T 
z La = I’ p dx 
dx a+dx a 
2 
aps dx=-I?pdx => ae. a 
dx \ dx dx? 


The negative sign means that when the temperature gradient is positive (increasing to the right), the heat 
flow is negative (to the left), i.e. the heat flow is opposite the gradient. Many physical systems have a similar 
negative sign. Thus the 2" derivative of the temperature is the negative of heat outflow (net inflow) from a 
segment, per unit length of the segment. Longer segments have more net outflow (generate more heat). 


3D Rectangular Volume Element 


Now consider a 3D bulk resistive material, carrying some current. The current generates heat in each 
volume element of material. Consider the heat flow in the x direction, with this volume element: 


Zz 4 
Outflow surface area 


. i lee is the same as inflow 


rm dx 
The temperature gradient normal to the y-z face drives a heat flow per unit area, in W/m”. For a net flow to 
the right, the temperature gradient must be increasing in magnitude (becoming more negative) as we move 
to the right. The change in gradient is proportional to dx, and the heat outflow flow is proportional to the 
area, and the change in gradient: 


Pay —P, e 
P| -P,= gf av dx dy dz => out in __p d°T 
out in aa ae ay? ae 


Thus the net heat outflow per unit volume, due to the x contribution, goes like the 2™ derivative of T. 
Clearly, a similar argument applies to the y and z directions, each also contributing net heat outflow per unit 
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volume. Therefore, the total heat outflow per unit volume from all 3 directions is simply the sum of the heat 
flows in each direction: 


Faia = {5 aT aa 


+ + 
dx dy dz ax? dy? az? 


We see that in all cases, the 


net outflow of flux per unit volume = change in (flux per unit area), per unit distance 


We will use this fact to derive the Laplacian operator in spherical and cylindrical coordinates. 


General Laplacian 


We now generalize. For the Laplacian to mean anything, it must act on a scalar field whose gradient 
drives a flow of some physical thing. 


Example 1: My favorite is T(r) = temperature. Then V7(r) drives heat (energy) flow, heat per unit 
time, per unit area: 


ean =q=-kVT(r) where  k = thermal conductivity 
area 
q =heat flow vector 
oT : 
Then or ~ q, = radial component of heat flow 
r 


Example 2: 7(r) = pressure of an incompressible viscous fluid (e.g. honey). Then V7(r) drives fluid 
mass (or volume) flow, mass per unit time, per unit area: 


mass /t ‘ Fae ae 
=j=-kVT(r) where k =some viscous friction coefficient 
area 
j=mass flow density vector 
ors, : 
Then aa ~ j, =radial component of mass flow 
r 


Example 3: 7(r) = electric potential in a resistive material. Then V7(r) drives charge flow, charge per 
unit time, per unit area: 


harge/t _, : is 
Seer j=-oVT(r) where o =electrical conductivity 
area 
j= current density vector 
ors, . ; 
Then em ~ j, = radial component of current density . 
r 


Example 4: Here we abstract a little more, to add meaning to the common equations of 
electromagnetics. Let T(r) = electric potential in a vacuum. Then V7(r) measures the energy per unit 


distance, per unit area, required to push a fixed charge density p through a surface, by a distance of dn, normal 
to the surface: 


cucteyaiane = pVT(r) where p=electric charge volume density . 
area 


Then 07/Or ~ net energy per unit radius, per unit area, to push charges of density p out the same distance 
through both surfaces. 
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In the first 3 examples, we use the word “flow” to mean the flow in time of some physical quantity, per 
unit area. In the last example, the “flow” is energy expenditure per unit distance, per unit area. The 
requirement of “per unit area” is essential, as we soon show. 


Laplacian In Spherical Coordinates 


To understand the Laplacian operator terms in other coordinates, we need to take into account two 
effects: 


1. The outflow surface area may be different than the inflow surface area 


2. The derivatives with respect to angles (6 or ¢) need to be converted to rate-of-change per unit 
distance. 


We’ll see how these two effects come into play as we develop the spherical terms of the Laplacian operator. 
The cylindrical terms are simplifications of the spherical terms. 


Spherical radial contribution: We first consider the radial contribution to the spherical Laplacian 
operator, from this volume element: 


Zz 
Outflow surface area dQ = sin 0 dgdé 
a™ : is differentially 


y larger than inflow 


fe 
ea | flow a Ty 

The differential volume element has thickness dr, which can be made arbitrarily small compared to the 
lengths of the sides. The inner surface of the element has area r? dQ. The outer surface has infinitesimally 
more area. Thus the radial contribution includes both the “surface-area” effect, but not the “converting- 
derivatives” effect. 


The increased area of the outflow surface means that for the same flux-density (flow) on inner and outer 
surfaces, there would be a net outflow of flux, since flux = (flux-density)(area). Therefore, we must take the 
derivative of the flux itself, not the flux density, and then convert the result back to per-unit-volume. We do 
this in 3 steps: 


a) 

flux = flux-density) ~ (r7dQ)) — 
ux = (area )(flux-density ) (r lea) 
d (flux ) -£ (ra 2) 

dr or or 

fl 

outflow d(flux) 1 @ (aa =)- 1 @ (P\(2) 
volume (area)dr *dQ Or or) 2 Or or 


The constant dO factor from the area cancels when converting to flux, and back to flux-density. In other 
words, we can think of the fluxes as per-steradian. 


We summarize the stages of the spherical radial Laplacian operator as follows: 
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10,0 
V2 Tr) =——r? —Ter 
< ( ) r2 or Or ( ) 

0 : : 
—T =radial flux per unit area 
rr 


(area)( flow per unit area) 


r or = radial flux, per unit solid-angle = 


or dQ 
0 20 ? : . : : See ens . 
a a = change in radial flux per unit length, per unit solid-angle; positive is increasing flux 
r r 
104.0 : : : ‘ 
2a —T =change in radial flux per unit length, per unit area 
P 


= net outflow of flux per unit volume 


1 Oo a) 
= ar =T 
r r or 

radial flow 


per unit area 


radial flux 
per steradian 


eee 
change in radial flux per 
unit length per steradian 


change in radial flux per 
unit length, per unit area 


Following the steps in the example of heat flow, let T(r) = temperature. Then 


oy = radial heat flow per unit area, Wim? 


7 
r or = radial heat flux, W/solid-angle = ulus 
or steradian 
0 20 : : ; ; P 
a a = change in radial heat flux per unit length, per unit solid-angle 
r r 
If, or = net outflow of heat flux per unit volume 
r- or or 


Spherical azimuthal contribution: The spherical ¢ contribution to the Laplacian has no area-change, 
but does require converting derivatives. Consider the volume element: 


Zz 
Outflow surface area 
is identical to inflow 


flow 


do 
The inflow and outflow surface areas are the same, and therefore area-change contributes nothing to the 
derivatives. 

However, we must convert the derivatives with respect to ¢into rates-of-change with respect to distance, 
because physically, the flow is driven by a derivative with respect to distance. In the spherical ¢ case, the 
effective radius for the arc-length along the flow is r sin 9, because we must project the position vector into 
the plane of rotation. Thus, (0/0) is the rate-of-change per (r sin #) meters. Therefore, 

é 
rsin0 0¢ 


rate-of-change-per-meter = 
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Performing the two derivative conversions, we get 


1 0 1 6 
VT (r) = T 
gt) rsin@ 0¢ rsin@ 0¢ = 


: of = azimuthal flux per unit area 
rsin0 0¢ 
ool : : : ‘ 
— — —T =change in (azimuthal flux per unit area) per radian 
0¢ rsin@ 0¢ 
1 oo 1 


T =change in (azimuthal flux per unit area) per unit distance 


rsin@ 0¢ rsind 0¢ 


= net azimuthal outflow of flux per unit volume 


i. 6.1.0, - 1.2 
rsin@ dO¢rsin@ 0d r’ sin? 0 0g" 


azimuthal flux 
per unit area 


————— 
change in (azimuthal flux 
per unit area) per radian 


change in (azimuthal flux per 
unit area) per unit distance 
Notice that the r? sin? 6 in the denominator is not a physical area; it comes from two derivative 
conversions. 


Spherical polar angle contribution: 


Zz flow 
Outflow surface area 


is differentially 


| y dg larger Sanaa laa: 
ae ae 


The volume element is like a wedge of an orange: it gets wider (in the northern hemisphere) as 0 
increases. Therefore the outflow area is differentially larger than the inflow area (in the northern 


hemisphere). In particular, area = (r sin 0) dr , but we only need to keep the 0 dependence, because the 


factors of r cancel, just like dQ did in the spherical radial contribution. So we have 
areacx sin®. 


In addition, we must convert the 0/00 to a rate-of-change with distance. Thus the spherical polar angle 
contribution has both area-change and derivative-conversion. 


Following the steps of converting to flux, taking the derivative, then converting back to flux-density, we 
get 
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1 10 10 
V9 T(r) = inO———T 
o Tr) aueroe” r 00 ” 
16, = 6-flux per unit area 
roo 


sin ei lr = 6-flux, per unit radius = Aare Peres ae) 
r 00 dr 


Oak o! 65 = change in (0-flux per unit radius), per radian 
00 roo 


1 1 B 
sin@ g T = change in (0-flux per unit radius), per unit distance 
r 00 r 00 


E ve sin 0 2 T = change in (6-flux per unit area), per unit distance 
sind r 00 r 00 


= net outflow of flux per unit volume 


E E snp © -— O inp! @ 
sind r 00 r 00 r- sin@ 00 00 


6-flux per 
unit area 


6-flux, per 
unit radius 


rr, 
change in (6-flux per 
unit radius), per radian 


change in ( 6-flux per unit 
radius), per unit distance 


change in (-flux per unit 
area), per unit distance 


Notice that the 7? in the denominator is not a physical area; it comes from two derivative conversions. 


Cylindrical Coordinates 
The cylindrical terms are simplifications of the spherical terms. 


Zz : and z outflow flow 
Radial outflow Y 4 
. surface areas are 
surface area is * : ; 
; identical to 


differentially larger 


than inflow flow ales dz =~ 
flow dr dd 


Cylindrical radial contribution: The picture of the cylindrical radial contribution is essentially the 
same as the spherical, but the “height” of the slab is exactly constant. We still face the issues of varying 
inflow and outflow surface areas, and converting derivatives to rate of change per unit distance. The change 
in area is due only to the arc length r d¢, with the z (height) fixed. Thus we write the radial result directly: 
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ye ~IW)= Z £ r <r) (Cylindrical Coordinates) 


r or r 


<r = radial flow per unit area 
a 


poe = radial flux per unit angle = (oy Dee Ui ates tare) 


or d@ dz 
0 0 : . . ; : 
cS r—T =change in (radial flux per unit angle), per unit radius 
r Or 
10 


0 ‘ ; : F ; 
——r—T =change in (radial flux per unit area), per unit radius 


= net outflow of flux per unit volume 


1 2, De 
r or or 
radial flow 


per unit area 


—_—_______] 
radial flux 
per radian 


ot 
change in radial flux per 
unit length per radian 


change in (radial flux per 
unit area), per unit radius 
Cylindrical azimuthal contribution: Like the spherical case, the inflow and outflow surfaces have 
identical areas. Therefore, the ¢ contribution is similar to the spherical case, except there is no sin 0 factor; 
r contributes directly to the arc-length and rate-of-change per unit distance: 


V4 T(r) = Le 1 a 
r 0g r 0d 
1 0 : : 
——T = azimuthal flux per unit area 
; 

010 . . : é 
—-——T =change in (azimuthal flux per unit area ) per radian 
0¢ r 0d 

1010 : : : ‘pas 
——-—T =change in (azimuthal flux per unit area) per unit distance 
r 0g r 0d 
= net azimuthal outflow of flux per unit volume 
2 
10 10 T _ 10 T 
rdp ro¢ r 0g? 
| 


azimuthal flow 
per unit area 


| 
change in azimuthal 
flow per radian 


change in (azimuthal flux per 
unit area) per unit distance 


Cylindrical z contribution: This is identical to the rectangular case: the inflow and outflow areas are 
the same, and the derivative is already per unit distance, ergo: (add cylindrical volume element picture??) 
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0 0 
vo rays se 
ge) Oz Oz - 


0 P : 
rm = vertical flux per unit area 
z 


<< = change in (vertical flux per unit area) per unit distance 
Zz OZ 
= net outflow of flux per unit volume 
2 
for gh x 
Oz =O 822 


I 
vertical flux 
per unit area 


change in (vertical flux per 
unit area) per unit distance 


Laplacian of a Vector Field 


It gets worse: there’s a vector form of V. If E(x, y, z) is a vector field, then in rectangular coordinates: 


V°E=V-VE=VE,\i+VE,j+V EK . 


This arises in E&M propagation, and not much in QM. However, the above equality is only true in 
rectangular coordinates [I have a ref for this, but lost it??]. This is the divergence of the gradient of a vector 
field, which is a vector. In oblique or non-normal coordinates, the gradient and divergence must be covariant, 
and include the Christoffel symbols. 


Vector Dot Grad Vector 


In electromagnetic propagation, and elsewhere, one encounters the “dot product” of a vector field with 
the gradient operator, acting on a vector field. What is this v-V operator? Here, v(r) is a given vector field. 
The simple view is that v(r) -V is just a notational shorthand for 


v(r):-V= ae , 
“Ox °Oy “*& 


i a \({O,., 0, 0, 0 0 0 
because V(r): V =(v,&+ v9 +2): 5 nae yts Z\= "5 ays UE 
x y z x ~° Oy Y4 


by the usual rules for a dot product in rectangular coordinates. 


There is a deeper meaning, though, which is an important bridge to the topics of tensors and differential 
geometry. 


We can view the v -V operator as simply the dot product of the vector field v(r) 
with the gradient of a vector field. 


You may think of the gradient operator as acting on a scalar field, to produce a vector field. But the 
gradient operator can also act on a vector field, to produce a tensor field. Here’s how it works: You are 
probably familiar with derivatives of a vector field: 
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0A 
Let A(x, y,z) bea vector field. Then oS fee *y+ on z |is a vector field. 
Ox Ox Ox Ox 
aA, 
A, Ox 
si aa_| aA, 
Writing spatial vectors as column vectors, A=| A, |, and — =| — 
Ox Ox 
a oA, 
Ox 
Similarly, & and os are also vector fields. 
oy Oz 
By the rule for total derivatives, for a small displacement (dx, dy, dz), 
a4, @A, A, (dr) (aa,) (@4,) (aa, 
” ox Oy & ax oy oz 
2 OA, OA, OA, |ld 0A, 0A, 0A, 
dA =| dA, 7g eK > : yO | | > ae | <> lay 4] — Jaz. 
; Ox oy OZ ox Oy oy Ox oy oy 
“ dA, OA, 0A, fdz| |.) | aa. ] | aa, 
ox Oy ox oy Oz 


This says that the vector dA is a linear combination of 3 column vectors 0A/0x, 0A/dy, and 0A/0z, weighted 
respectively by the displacements dx, dy, and dz. The 3 x 3 matrix above is the gradient of the vector field 
A(r). It is the natural extension of the gradient (of a scalar field) to a vector field. It is a rank-2 tensor, which 
means that given a vector (dx, dy, dz), it produces a vector (dA) which is a linear combination of 3 (column) 
vectors (VA), each weighted by the components of the given vector (dx, dy, dz). 


Note that VA and V-A are very different: the former is a rank-2 tensor field, the latter is a scalar field. 


This concept extends further to derivatives of rank-2 tensors, which are rank-3 tensors: 3 x 3 x 3 cubes 
of numbers, producing a linear combination of 3 x 3 arrays, weighted by the components of a given vector 
(dx, dy, dz). And so on. 


Note that in other coordinates (e.g., cylindrical or spherical), VA is not given by the derivative of its 
components with respect to the 3 coordinates. The components interact, because the basis vectors also change 
through space. That leads to the subject of differential geometry, discussed elsewhere in this document. 
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4 Green Functions | 


We follow [Jac p??] and [Bra] in using the term “Green function,” rather than “Green’s function.” 
Though we agree with Jackson’s logic, we do it mostly because it’s easier to say and type. 


Green functions are a big topic, with lots of subtopics. Many references describe only a subset, but use 
words that imply they are covering all of Green functions. If you are looking for a specific application of 
Green functions, such as electrostatics, you may want to skip right to that section, but the “big idea” applies 
to all Green functions. 


Though Green functions are used to solve linear operator equations (such as differential equations), the 
concepts involved apply to other applications, such as the Born approximation, impulse response analysis, 
and quantum propagators. 


The Big Idea 


Green functions are a method of solving linear operator equations (such as inhomogeneous linear 
differential equations) of the form: 


LAF} = 5) where £{ } isa linear operator . (4.1) 
source 


s(x) is called the “source” function. We use Green functions when other methods are hard, or to make a 
useful approximation (the Born approximation). The big idea is to break up the source s(x) into infinitesimal 


pieces (d-functions), solve each piece separately, and add up the solutions. Since the C is linear, the sum of 


solutions is also a solution, and is the solution to the original problem. 


Sometimes, the Green function itself can be given physical meaning, as in E&M where it is essentially 
Huygen’s Principle, but with accurate phase information, or in Quantum Field Theory where it is the 
propagator of a quantized field. Green functions can generate particular (i.e. inhomogeneous) solutions, and 
solutions matching boundary conditions. They don’t generate homogeneous solutions (i.e., where the right 
hand side is zero). We explore Green functions through the following steps: 


1. Extremely brief review of the d-function. 
The tired, but inevitable, electromagnetic example. 
Linear differential equations of one variable (1-dimensional), with sources. 


2 

3 

4. Delta function expansions. 

5. Green functions of two variables (but 1 dimension). 
6 


When you can collapse a Green function to one variable (“portable Green functions”: translational 
invariance) 


7. Dealing with boundary conditions: at least 5 (6??) kinds of BC 
8. Green-like methods: the Born approximation 


You will find no references to “Green’s Theorem” or “self-adjoint” until we get to non-homogeneous 
boundary conditions, because until then, those topics are unnecessary and confusing. We will see that: 


The biggest hurdle in understanding Green functions is the boundary conditions. 


fa eS ec a ee eee eee eb ee eS EEE eee eee ee nearer 


Some references derive Green functions from Green’s Theorem, which derives from Gauss’ Law. 
That is only a special case. In general, Green functions do not rely on Green’s Theorem. 


We return to this point later, after discussing general boundary conditions. 


Dirac Delta Function 


Recall that the Dirac 6-function is an “impulse,” an infinitely narrow, tall spike function, defined as: 
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0(x)=0, for x #0, and : 0(x) dx =1, Va>O0 (the area under the 6-function is 1). 
—a 


(This also implies 6(0) + 00, but we don’t focus on that here.) The linearity of integration implies the delta 


function can be offset, and weighted, so that: 
b+a 
| w(x —b) dx =w Va>0O. 


b-a 


Since the 6-function is infinitely narrow, it can “pick out” a single value from a function: 
b+a 
J, O(x—b) f (x) dx = f (b) Va>0. (4.2) 
= 


This is called the “filtering property” of the 6-function. See Quirky Quantum Concepts for more on the delta 
function. The units of d( ) are [x]. 


The Tired, But Inevitable, Electromagnetic Example 


You probably have seen Poisson’s equation relating the electrostatic potential at a point to the charge 
distribution creating the potential (in gaussian units): 


-V’*@(r) = 47p(r) where =electrostatic potential, po = charge density . (4.3) 


We solved this by noting three things: (1a) electrostatic potential, ¢, obeys “superposition:” the potential due 
to multiple charges is the sum of the potentials of the individual charges; (1b) the potential is proportional to 
the source charge; and (2) if we take the potential at infinity to be zero, the potential due to a point charge is: 


dr) = TA (point charge at r'). (4.4) 
r-r 
(We say much more about boundary conditions later.) The properties (1a) and (1b) above, taken together, 
define a linear relationship: 


Given: p,(r')>¢(r), and P7(v') > @ (1), 
then: ap (r')+p.(r') > Prorat F) = a9 (") + py (1). 


To solve (4.3), we break up the source charge distribution p(r) into an infinite number of little point 
charges. The set of points is spread out over space, each of charge p(r) d°r. The solution for ¢is the sum of 
potentials from all the point charges, and the infinite sum is an integral, so we find gas: 


# points 
q 1 


1 
— li r @ —=| ' a he 
or) ie 2 p(r';) eae p(r') d°r For] 


Note that the charge “distribution” for a point charge is a 0-function: infinite charge density, but finite total 
charge. Also, ¢(r) for a point charge at r’ is translationally invariant: it has the same form for all r’. We will 
remove this restriction later. 


All of this followed from simple mathematical properties of Eq (1) that have nothing to do with 
electromagnetics. All we used to solve for ¢ was that the left-hand side is a linear operator on ¢ (so 
superposition applies), and we have a known solution when the right-hand side is a delta function at r’: 


1 
-V? g(r) = 4ap(r) and Ve = O(r-r'). 
— 1 5 i \r—r' t____ 
linear ynknown given "source" linear given point 
: : 
operator function function operator Town "source" at r' 
solution 
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Since any given p can be written as a sum of weighted 6-functions, the solution for that given p is a sum of 
delta-function solutions. Now we generalize this electromagnetic example to arbitrary (for now, 1D) linear 


operator equations by letting r > x, d— f, -V? > L, p > s, and call the known 6-function solution G(x): 


fi _ DO ee 4 
CV" Pr) = 4zp(r) and V o(r-r') > 


- 
r-r 
Lf (x) s(x) £ | | given point 
G(r-r') "source" at r' 


Given L{f(x)}=s(x) and L{G(x-x'} = d(x-x), 
then foy=f dx' s(x') G(x-x’'). 
This assumes, as above, that our linear operator, and any boundary conditions, are translationally invariant. 


A Fresh, New Signal Processing Example 


If the following example doesn’t make sense to you, just skip it. Signal processing and control theory 
folk have long used a Green function-like concept, but with different words. A time-invariant linear system 
(TILS) produces an output which is a linear operation on its input: 


o(t) = M{i(t)} where M{ } is a linear operation taking input to output . 


In this case, we aren’t given M {}, and we don’t solve for it (also it’s on the right-, rather than the left-side 


of the equation). However, we are given a measurement (or computation) of the system’s impulse response, 
called h(t). If you poke the system with a very short spike (i.e., if you feed an impulse into the system, i(f) = 
(ft) ), the system responds with hA(f): 


h(t) = M{ 5(t)} where h(t) is the system's impulse response . 


h(t) acts like a Green function, giving the system response at time ¢ to a delta function at t=0. Note that h(t) 
is spread out over time, and usually of (theoretically) infinite duration. A(f) fully characterizes the system, 
because we can express any input function as a series of impulses (with the delta-function expansion below), 
and sum up all the responses. Therefore, we find the output for any input, i(4), with: 


o(t) = f i(t')A(t—t)dt'. 
Caution: many references do not distinguish between a Green function G(x) and an impulse response 
h(x). The two are similar, but they differ because: 
L{G(x)} = 6(x), but h(x) = M{6(x)} : 


The 6-function is in a different place for a Green function vs. an impulse response. For example, in 
electromagnetics, sources (charges and currents) are the stimulus that result in fields (E and B). Maxwell’s 
equations have linear operators acting on the result (fields) to give you the stimulus. A TILS does the reverse: 
it produces a result which is a linear operation on its input (stimulus). 

We can see a relationship between a Green function and an impulse response by taking M_! (if it exists) 


of both sides of the second equation: 
M1 fh(x)} = 5(x). 


Thus the impulse response for an operator M is the Green function for the operator M-!. In particular, 


quantum field theory calls the field “propagator” a Green function, but it is more directly thought-of as an 
impulse response. 
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Linear differential equations of one variable, with sources 
We wish to solve for f(x), given s(x): 
£1 FG) =s@), where L{ }isa linear operator. 
5(x) is called the "source," or forcing function . 


a 4 a 2 
E.g., —z7t@ fa=—z f@+o f(x) = s(x). 
dx dx 


We ignore boundary conditions for now (to be dealt with later). The differential equations often have 3D 
space as their domain. Note that we are not differentiating s(x), which will be important when we get to the 
delta function expansion of s(x). 


Green functions solve the above equation by first solving a related equation: if we can find a function 
(i.e., a “Green function”) such that: 


L{G(x)} = 0(x), where 0(x) is the Dirac delta function, 


a 
€.2., —z + @* |G(x) = d(x), 
dx 


then we can use that Green function to solve our original equation. This might seem weird, because 6(0) > 
oo, but it just means that Green functions often have discontinuities in them or their derivatives. For example, 
suppose G(x) is a step function: 


G(x) =0, x< of 


d 
Th — G(x) =6(x). 
=1, x>0 si dx ) 


Now suppose our source isn’t centered at the origin, i.e., s(x) = d(x—a) . If re \ is translation invariant 


[along with any boundary conditions], then G( ) can still solve the equation by translation: 


LAF} = s(x) = d0(x-a), > f(x) =G(x-a) isa solution. 


If s(x) is a weighted sum of delta functions at different places, then because | } is linear, the solution is 


immediate: we just add up the solutions from all the 6-functions: 
L{ f(x} = s(x) = Do w(x -%) => f(x)= > wiGa-x). 
i i 


Usually the source s(x) is continuous. Then we can break up s(x) into infinitesimally small pieces (i.e., 
expand it as an infinite sum of delta functions, described in a moment), and sum the solutions for the pieces. 
The summation goes over to an integral, and a solution is: 


~ oe 
L{ f(o} =s(2) = Do w5(e—%,) > 
i=l 


Li f(o} = s(x) = J Seta H(x= x) and f(x) = I “eal 8@IEE=*) 


We can show directly that f(x) is a solution of the original equation by plugging it in, and noting that 
LA \ acts in the x domain, and “goes through” (i.e., commutes with) any operation in x’: 
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c{f@}= eff dx' s(x)G(x -s)| 
= | dx' s(x\L{Gx - xy} moving Lt \ inside the integral 


= ( dx' s(x')6(x — x') = s(x) 0() picks out the value of s(x). QED. 
We now digress for a moment to understand the d-function expansion. 


Delta Function Expansion 


As in the EM example, it is frequently quite useful to expand a given function s(x) as a sum of 6- 
functions: 


N 
S(X) & > W,0(X— X;), where  w, are the weights of the basis delta functions . 


i=l 
[This same expansion is used to characterize the response of linear systems to input i(f).] 


w; = area 
s(x) = s(x,)Ax 


5(x) 


: xX; : 
(b) — Ax 


Figure 4.1 (a) Approximating a function with 6-functions. (b) The weight of each 6-function is 
such that its integral approximates the integral of the given function, s(x), over the interval “covered” 
by the 6-function. 


In Figure 4.1a, we approximate s(x) first with N = 8 6-functions (green), then with NV = 16 6-functions (red). 
As we double N, the weight of each d-function is roughly cut in half, but there are twice as many of them. 
Hence, the integral of the d-function approximation remains about the same. Of course, the approximation 
gets better as N increases. As usual, we let the number of 6-functions go to infinity: N > o. 


In Figure 4.1b, we show how to choose the weight of each d-function: its weight is such that its integral 
approximates the integral of the given function, s(x), over the interval “covered” by the d-function. In the 
limit of N — oo, the approximation becomes arbitrarily good. 


In what sense is the 6-function series an approximation to s(x)? Certainly, if we need the derivative s'(x), 
the delta function expansion seems to be terrible. However, if we want the integral of s(x), or any integral 
operator, such as an inner product or a convolution, then the delta function series is a good approximation. 
Examples: 


For [sc dx, or | Ff (x)s(x) dx, or } Ff (x'—x)s(x) dx, 


N 
then S(X) & » w0(x — x;) where w; = S(x;)Ax. 
i=1 


As N — ©, we expand s(x) in an infinite sum (an integral) of 6-functions: 
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Xj>x' 
Ax>dx' 
wis(x')dx' 


S(X) = So w5(x—x;) > s(x) = fax S(x')O(x—-x'), 


which if you think about it, follows directly from the definition of d(x), per (4.2). 


[Aside: Delta functions are a continuous set of orthonormal basis functions, much like sinusoids from quantum 
mechanics and Fourier transforms. They satisfy all the usual orthonormal conditions for a continuous basis, i.e. they 
are orthogonal and normalized: 


[4x 5@-a)o(x-b) = 5(€a-b) 1 
Note that in the final solution of the prior section, we integrate s(x) times other stuff: 
fay=[ de sx)Ge-x), 


and integrating over s(x) is what makes the 6-function expansion of s(x) valid. 


(Aside: It turns out that even systems that differentiate s(x) can use the 6-function expansion, but we need not 
bother with that here. ] 


Boundary Conditions on Green Functions 


Most problems require boundary conditions on the solution to an equation. 


Introduction to Boundary Conditions 


We now impose a simple boundary condition on an equation we seek to solve. Consider a 2D problem 
in the plane: 


reece y)} = S(x,y) inside the boundary; 
f (boundary) = 0, where the boundary is given. 
We define the vectors r = (x, y) and r’ = (x’, y’), and recall that: 
5° (r) =6(x)d(y), so that 5° (r-r')=(x-x5(y-y’). 


The boundary condition removes the translation invariance of the problem (Figure 4.2). The delta function 
response of L{G@r)} translates, but the boundary condition does not. Le., a solution of: 


L{G(r)} =6(r), and G(boundary) =0 => L{G@r —r')} = O(r-r') 
BUT does NOT => G(boundary-r')=0. 


y y boundary condition does 
boundary boundary condition —_}--~~*.. not translate with r’ 
remains fixed > ‘ 


(boundary) = 0 


Domain 


of fi y) 


(a) (b) 


Figure 4.2 (a) The domain of interest (blue), and its boundary (red). (b) A solution meeting the 
BC for the source at (0, 0) does not translate to another point r’ and still meet the BC. 
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With boundary conditions, for each source point r', we need a different Green function! 


The Green function for a source point r’, call it Gr(r), must satisfy both: 
L{G,.(r)} = d(r-r') and G,.(boundary) =0. 


We can think of this as a Green function of two arguments, r and r’, but really, r is the argument, and r' is a 
parameter. In other words, we have a family of Green functions, Gr(r), each labeled by the location of the 
source point, r’. 


Note that finding 1D Green functions is an important prerequisite for 3D Green functions, because a 3D 
problem sometimes separates into a 2D and a 1D problem. We give such an example in the section on 3D 
Laplacian operator boundary conditions. 


One Dimensional Boundary Conditions 


Example: Returning to a 1D example in : Find the Green function for the equation: 


2 
ff) =s(r), on the interval [0,1], subject toBC: f(0)= fC) =0. 
dr 


Solution: The Green function equation replaces the source s(r) with d(r — r'): 
d 
— G(r) =dO(r-r'). 
dr 
Note that G,-(r) satisfies the homogeneous equation on either side of 7’: 
2 


 6@ #r'y=0. 
dr 


The full Green function simply matches two homogeneous solutions, one to the left of r’, and another to the 
right of 7’, such that the discontinuity at r’ creates the required 6-function there. First we find the 
homogeneous solutions h(r) (not an impulse response): 


2 

© Ar) =0 Integrate both sides: 

dr 

d F : : : 

rel) =C where Cis an integration constant. Integrate again: (4.5) 
: 

h(r)=Cr+D where C,D are arbitrary constants . 


There are now 2 cases: (left) r <7’, and (right) r >’. Each solution requires its own set of integration 
constants. 


Left case: r<r' => G,.(r)=Cr+D 
Only the left boundary condition applies to r <r': G,(0)=0 => D=0 


' 


Right case: r>r => G,.(r)=Er+F 
Only the right boundary condition applies tor>r': G.()=0 => E+F=0, F=-E. 
So far, we have: 
Left case: G(r <r')=Cr Right case: G(r >r')=Er-E. 


The integration constants C and E are as-yet unknown. Now we must match the two solutions at r= r', 
and introduce a delta function there. The 6-function must come from the highest derivative in L{ }, in this 
case the 2™ derivative, because if dG/dr had a delta function, then the 2™ derivative d*G/dr? would have the 
derivative of a d-function, which cannot be canceled by any other term in £{ }. Since the derivative of a step 
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(discontinuity) is a 6-function, dG/dr must have a step, so that d?G/dr’ has a 6-function. And finally, if dG/dr 
has a step, then G(r) has a cusp (aka “kink” or sharp point). 


We can find G(r) to satisfy all this by matching G(r) and dG/dr of the left and right Green functions, at 
the point where they meet, r=?’: 


Left: 46.6220 Right: 2G os ier, 
dr dr 


There must be a unit step in the derivative across r=7r°’: 


OG OG 
+l= 


— =— => C+1=E. (4.6) 
or or 


r'_ r'+ 


So we eliminate EF in favor of C. Also, G(r) must be continuous (or else dG/dr would have a 6-function), 
which means: 


G..(r=r'_)=G,(r=r',) => Cr'=(C+)r'-C-I, C=r-l, 
yielding the final Green function for the given differential equation and boundary conditions: 
G(r <r')= (r'-1)r, G(r >r')=r'r-r'= r'(r-1) ; 
Here’s a plot of these Green functions for different values of r': 


G(r) G..(r) G,.(r) 


Normalization is important, because the 6-function in Li{G(r)} =6(r) must have unit magnitude. 


To find the solution f(r), we need to integrate over r'; therefore, it is convenient to write the Green 
function as a true function of two variables: 


G(rsr')=G,(r) => L{G(r3r'} =d(r-r), and G(boundary ; r') =0, 


where the “;” between r and r' emphasizes that G(r ; r') is a function of r, parameterized by 7’. I.e., we can 
still think of G(7; r') as a family of functions of r, where each family member is labeled by 7’, and each family 
member satisfies the homogeneous boundary condition. 


Finally, the particular solution to the original equation, which now satisfies the homogeneous boundary 
conditions, is: 


1 r 1 
roy=| dr's(r G(r sr) = | dr‘ s(r') r'(r-1) +f dr‘ s(r') (r'=1)r 
0 0 r 4 
G(r;r'),r>r' G(r;r'),r<r'- 


which satisfies _f (boundary) =0 


Summary: To solve L{G,.(r)} = 6(r-—r') in one dimension: 
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e We break G(r) into left- and right- sides of 7’. Each side satisfies the homogeneous equation, 
LIG,(r)} =0, with arbitrary integration constants. 


e Weestablish a first matching condition on G(r), which is usually that it must be continuous at 


' 


r. 


e We establish another matching condition to achieve the d-function at r'. This establishes a set 
of simultaneous equations for the integration constants in the homogeneous solutions. 


e We solve for the constants, yielding the left-of-r' and right-of-r' pieces of the complete Green 
function, G(r; r’). 


Aside: It is amusing to notice that we use solutions to the homogeneous equation to construct the Green 
function. We then use the Green function to construct the particular solution to the given (inhomogeneous) 
equation. So we are ultimately constructing a particular solution from a homogeneous solution. That’s not like 
anything we learned in undergraduate differential equations. 


When Can You Collapse a Green Function to One Variable? 


“Portable” Green Functions: When we first introduced the Green function, we ignored boundary 
conditions, and our Green function was a function of one variable, r. If our source wasn’t at the origin, we 
just shifted our Green function, and it was a function of just (r—r’). Then we saw that with (certain) boundary 
conditions, shifting doesn’t work, and the Green function is a function of two variables, r andr’. In general, 
then, under what conditions can we write a Green function in the simpler form, as a function of just (r— r’)? 


We can say it’s “portable.” 


This is fairly common: differential operators are translation-invariant (i.e., they do not explicitly depend 
on position), and BCs at infinity are translation-invariant. For example, in E&M it is common to have 
equations such as: 


-V* g(r) =p(r), with boundary condition ¢(0)=0. 


Because both the operator —V” and the boundary conditions are translation invariant, we don’t need to 
introduce r' explicitly as a parameter in G(r). As we did in (4.4) when introducing Green functions, we can 
take the origin as the location of the delta function to find G(r), and use translation invariance to “move 
around” the delta function: 

G(r 5r')=G,.(r)=G(r-r') and £{G(r-r)} =d(r-r/ 

with BC: G(o) =0 


Non-homogeneous Boundary Conditions 


So far, we’ve dealt with homogeneous boundary conditions by requiring G,.(r) = G(r ;r') to be zero on 
the boundary (which may be at infinity). But there are different kinds of boundary conditions, and different 
ways of dealing with each kind. 


[Note that in general, constraint conditions don’t have to be specified at the boundary of anything. They are 
really just “constraints” or “conditions.” For example, one constraint is often that the solution be a “normalized” 
function, which is not a statement about any boundaries. But in most physical problems, at least one condition does 
occur at a boundary, so we defer to common usage, and limit ourselves here to boundary conditions. ] 


Boundary Conditions Specifying Only Values of the Solution 


In one common case, we are given a general (inhomogeneous) boundary condition, m(r) along the 
boundary of the region of interest. Our problem is now to find the complete solution c(r) such that 
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Li{e(r)} =s(r), and c(boundary) = m(boundary) . 


One approach to find c(r) is from elementary differential equations: we find a particular solution f(x) to the 
given equation, that doesn’t necessarily meet the boundary conditions. Then we add a linear combination of 
homogeneous solutions to achieve the boundary conditions, while preserving the solution of the non- 
homogeneous equation. There are 3 steps: 


(1) First solve for f(7), as above, such that: 
L{fM}=s—r), and f (boundary) =0, 
using a Green function satisfying: 
L{G(rsr)} = d(r-r') and G(boundary;r') =0. 


(2) Find homogeneous solutions h;(r) which are non-zero on the boundary, using ordinary methods (see 
any differential equations text): 


Lith; (r)} =0, and h, (boundary) # 0. 


Recall that in finding the Green function, we already had to find homogeneous solutions, since every Green 
function is a homogeneous solution everywhere except at the d-function position, r’. 


(3) Finally, we add a linear combination of homogeneous solutions to the particular solution to yield a 
complete solution which satisfies both the differential equation and the boundary conditions. Thus we find 
coefficients A; such that: 


Ajh,(r) + Aghy(r)+...=m(r), and LLAjh,(r)+ Aghy(r) +...} =0 by superposition . 
Then our solution is c(r): 
c(r) = f (r) + Ay (r) + Aghy (r) +... because, 
L{c(r)} = LE f (r) + Ayhy (1) + Aghy(r) +...} 
=£1 f(r)} str) and c(boundary) = m(boundary) 


Continuing Example: In our 1D example above, we have: 
re \=— and G,.(r<r')=(r'-1)r, G.(r>r)=r'(r-1), 


satisfying BC: G,.(0)=G,.(1) =0 > f(0) = f() =0, Vs(r). 


We now add boundary conditions to the original problem: c(0) = 2, and c(1) = 3, in addition to the original 
problem. Our linearly independent homogeneous solutions are, from (4.5): 


h(r) = Ar ho(r)=Ap (a constant) . 
To satisfy the BC, we need 

h(0)+hy(0)=2 => Ap = 2 

hDM+h)=3 > A; =1 


Thus our complete solution, satisfying the given BCs, is: 


c(r) -(j var siryerir) 442. (4.7) 
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Boundary Conditions Specifying a Value and a Derivative 


Another common kind of boundary conditions specifies a value and a derivative for our complete 
solution. For example, in 1D: 


c(0) =1 and c'(O)=5. 


Recall that our previous Green function (4.7) is not required to have any particular derivative at zero. When 
we find a particular solution, f(x), we have no idea what it’s derivative at zero, f'(0), will be. And in particular, 
different source functions, s(r), will produce different f(r), with different values of f'(0). This is bad for our 
new BCs. In the previous case of BC, f(r) was zero at the boundaries for any s(r). What we need with our 
new BC is f(0) = 0 and f'(0) = 0 for any s(r). We can easily achieve this by using a different Green function! 
We subjected our first Green function to the boundary conditions G(0; r’) = 0 and GC; r’) = 0 specifically 
to give the same BC to f(r), so we could add our homogeneous solutions independently of s(r). Therefore, in 
2 


our example L{ \ = ae ,we now choose our Green function BC to be: 
dr 
bi d Y hf r d° Y ’ 
G(0;r') =0 and =—G(0;r')=0, with  L{G(r;r)}=—>G(sr)=d(r-r). 
dr dr 
We can see by inspection that this leads to a new Green function (Figure 4.3): 
G(rr') =0 r<r', and GOsr')=r-r' or>r'. 
G(r 51’) G(r; r') G(r 51’) 


Figure 4.3 Green functions for 3 different values of r’. 


The 2" derivative of G(r; r’) is everywhere 0, and the first derivative changes from 0 to 1 at r’. Therefore, 
our new particular solution f(r) also satisfies: 


1 
f@M= Ja" s(r')\G(r3r') and fO)=0, f'O0)=0, Vs(r). 
We complete the solution using our homogeneous solutions to meet the BC: 
hy(r) = Ayr ho(r) = Ag (a constant) 
h(0)+hj(O)=1 => Ap =1 
hy (0) + hg (0) =5 > A, =5. Then: 


c(r)= J ‘ar ser) +5r+1 


In general, the Green function depends not only on the particular operator, 
but also on the kind of boundary conditions specified. 


The Green function does not depend on the values of the given BCs. 
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Boundary Conditions Specifying Ratios of Derivatives and Values 


Another kind of boundary conditions specifies a ratio of the solution to its derivative, or equivalently, 
specifies a linear combination of the solution and its derivative be zero. This is equivalent to a homogeneous 
boundary condition: 

c'(0 
O_, 
c(0) 


or equivalently Gf c(0) # 0) c'(0)-—ac(0) =0. 


This BC arises, for example, in some quantum mechanics problems where the normalization of the wave- 
function is not yet known; the ratio cancels any normalization factor, so the solution can proceed without 
knowing the ultimate normalization. Note that this is only a single BC. If our differential operator is 2" 
order, there is one more degree of freedom that can be used to achieve some other condition, such as 
normalization. (This BC is sometimes given as fc'(0) — ac(O) = 0, but this simply multiplies both sides by a 
constant, and fundamentally changes nothing.) 


Importantly, this condition is homogeneous: a linear combination of functions which satisfy the BC also 
satisfies the BC. This is most easily seen from the form given above, right: 


If d‘(0)-—ad(0) =0, and e'(0) —ae(0) =0, 
then c(r) = Ad(r) + Be(r) satisfies c'(0)—ac(0)=0 
because c'(0)—ac(0) = A(d '(O)- ad(0)) + B(e'(0) - ae(0)) 


Therefore, if we choose a Green function which satisfies the given homogeneous BC, our particular solution 
J will also satisfy the BC. There is no need to add any homogeneous solutions. 


Continuing Example: In our 1D example above, with L = d’/dr’, we now specify the BC: 


0) 
c(0) 


or c'(0) —2c(0) =0. 


Green functions for this operator are always two connected line segments (because their 2"! derivatives are 
Zero), SO we have: 


r<r': G(r')=Cr+D, D#0 so that c(O) #0; 
r>r': G@r')=Er+F 
BCatO: C-2D=0 
With this BC, we have an unused degree of freedom, so we choose D = 1, implying C = 2. We must 


find E and F so that G(r; r’) is continuous, and G’(r; r’) has a unit step at r’. The latter condition requires 
that EF = C + 1 = 3, and then continuity requires: 


Cr'+ D=Er'+F => 2r'+1=3r't+F, F=-r'+1. So: 
r<r':) G(rsr')=2r+l1 and r>r':) G(r3r')=3r-r'+l 
G(r ;r') G(r 57’) 


Figure 4.4 1D Green functions; the slope changes of +1 occur at r' (dotted red lines), but are subtle 
on this scale. 
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Then our complete solution is just: 


1 
cry=f(y= i dr's(r')G(rsr'). 


Boundary Conditions Specifying Only Derivatives (Neumann BC) 


Another common kind of BC specifies derivatives at points of the solution. For example, we might 
have: 


c'(0) =0 and c'(=1. 


Then, analogous to the BC specifying two values for c(_), we find a Green function which has zeros for its 
derivatives at 0 and 1: 


266070 and 2 Geeie yet, 
dr dr 
Then the sum (or integral) of any number of such Green functions also satisfies the zero BCs: 


P= oer s(r')G(r sr’) satisfies f'(0)=0 and f')=0. 


We can now form the complete solution, by adding homogeneous solutions that satisfy the given BC: 
cir) =f)+Ahy'r)+ Ah) '(r) ~~ where  Ajh,'(0)+ Aphy (0) = 0 
and Ayhy,'() + Agly '(D = 1 


Example: We cannot use our previous example where £{ } = d’/dr’, because there is no solution to: 


2, 
cae G(r;r')=6(r-r') with i G(r =03;r') d G(r=1;r')=0. 
dr’ dr dr 


This is because the homogenous solutions are straight line segments; therefore, any solution with a zero 
derivative at any point must be a flat line. So we must choose another operator as our example: TBS. 


2D?? and 3D Green Functions 


Green Functions Don’t Separate 


In previous sections, we described 1D Green functions, which satisfy: 
L{G(x;x)} =d(x-x). 


(We must change notation slightly from earlier, since in higher dimensions, “7” now has the conventional 
meaning: distance from the origin.) A 3D Green function satisfies: 


L{G(r;r')} = S(r-r') (coordinate free) . 


Note that 6? is a (coordinate-free) spherically symmetric function, with no preferred direction. We can choose 
to write it as a product of three coordinate functions. For example: 


LIG(x, Y,BX',y', z')} = 6(x-x')d(y— y')6(z-z') (rectangular coordinates ) : 


To generalize Green functions to 3D in rectangular coordinates, you might guess that we could multiply 
three separate 1D Green functions together. For example, if £ separates into x, y, and z parts, does the 
following hold? 
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Let L,{X (x; x')} =6(x—-x'), and similar for an {y(y; y)} and £, {Z(z; z)}. 
2 
Does G(x yz yi Z)EXOG ZY (93 VIZ(z3z')? Le., 
? 
(L, +L, +L, ){X(% ZY (y; yZ(z; z=5? (r-r')? 


We now show that does not work. As a concrete counter-example, consider the Laplacian operator, V. In 
1D, it is simply 67/€x”.. Applying our guess to 3D, we would have: 


2 
SS X54) = (0-2) and similarfor Y(y; y')and Z(z; z’). 
he 


2 2 2 
V*(XYZ)= aE (XYZ) = 6(x-x YZ + 5(y—- y')XZ+6(z-z)XY 
ax? dy? Gz? 


# 0(x-x')d(y—y')O(z-z’). 


Green functions do not separate the way solutions to Laplace’s equation do. 


Let us explore some properties of an actual 3D Green function. A well-known 3D Green function for 
the Laplacian, with BC of zero at infinity, is: 


ote >. 
4n|r-r'| 


G(r;r') =- 


For simplicity, we fix r' = Oy and drop the prefactor. For insight, we write it in rectangular coordinates: 


G(r;0,)4= | 


six? 4p" 427 


This is spherically symmetric, as required by the spherical symmetry of V? and the BCs, but has no other 


obvious structure. It does not seem to factor into X(x)Y(y)Z(z). Nonetheless, we have: 


2 2 2 
(2 é 3 1 oles, 


dx ea eee 4a 


By symmetry, the three directions each contribute the same amount to the sum, which is 1/3 of the total, so: 


-1 3 
zt) meen 


This means the 2" derivative in a single direction is immediately a 3"*-order delta function; this 6°( ) does 
not result from the product of one 6( ) in each direction. 


3D Green functions are hard to understand. We give some examples in the following sections. 


Green Units 


Coordinates have units, operators have units, Green functions have units, and delta functions have units. 
As always, we can use dimensional analysis to sanity-check results, which we do later. As a 1D example: 
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2 
[x] =L (length), S| =i, [éOjeb3 
1X 
oe? 
arene = E*|Gj=r°, ad: [Gl=2, 
IX 


If x is in meters, then so is G. 


A 3D units example: 
[x]=Ldength), [v7 ]=07,  [oQ]=r": 
VG=o > L*|e]=L>,. ad [ej=r": 
If the coordinates are in meters, then G is in inverse meters. 


Special Case: Laplacian Operator with 3D Boundary Conditions 


In electrostatics, one often uses Green functions with the Laplacian operator, £= V7, and boundary 
conditions, to find the electrostatic potential O(r). The Laplacian operator allows a “trick” (see glossary) for 
common boundary conditions, that gives a solution in terms of integrals. This section assumes you are 
thoroughly familiar with solving Laplace’s equation by separation of variables into eigenfunctions (see Funky 
Electromagnetics Concepts). Beware that some references define Green functions only for this electrostatic 
special case, and so present an overly narrow view of them. 


a re , 

source el 

volume nside S| outside S 

surface G(r, r')|, 
element, 
&s' 
: ‘. 
7 
>space 4 ex | 
(a) (b) (c) r'onS 


Figure 4.5 (a) A 3D distribution of charges, admired from within. (b) A 1D potential; the flux is 
proportional to 0®/dx'. (c) For Dirichlet BCs, form of G along the normal coordinate n for r' on the 
boundary surface S. 


Consider a distribution of source charges, as in Figure 4.5a. We continue with the definition of G from 
earlier sections, and gaussian units: 


VGiir)=S(r-r), and V*@(r)=-47p(r). 
s(r) 


Some references include a factor of (—1) or (-4z) on the 6-function in the definition of G. That breaks the 
generality of the Green method, and simply moves the factor from s(r) into the Green function itself, but the 
resulting integral (4.8) is identical, as it must be: P(r) is uniquely determined by p(r') and the BCs. Our 
convention for G is used in many references, and we believe is objectively simpler in both theory and practice. 


The Laplacian boundary condition trick starts with Green’s theorem, which relates a certain kind of 
volume integral to a surface integral. We give some insight to Green’s theorem in the next section, but the 
result is: for any functions defined inside a volume, ®(r') and y(r"): 
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er ee eee er ee ee AOU oc pe OD) tie 
[Jf eer? ver—venv? ory a’r =p Coe veya s 


ol 


: 0 ‘ 
where n'=normal coordinate, so, e.g., aa =V'yen' 
n 
Note that the primes denote source coordinates. In electrostatics, we let B(r') be the electrostatic potential 
inside the volume, and y(r') — G(r, r’), taking r as fixed. The operator V' tells us how a function changes 


as we move around the source coordinate r', with r held fixed. Then ® is explicitly given by (gaussian units): 


_ om ce OG _ 4 O® ) 26: 
269, = [ff,,oe tron) tredh,, [wer Zi-aeer 3h) 
inside s(r 

volume 


(4.8) 


where OVol = boundary of the volume. 


If r is outside the volume, it violates the terms of Green’s Theorem, the volume integral is zero, and the result 
is meaningless. At this point, we have not given any BCs for G, so as with all Green functions, there are 


many G that satisfy the defining equation V’G=5° (r—r'). We must find BCs for G to make it unique. 


Dirichlet BCs: There are 2 terms in the surface integral of (4.8). For Dirichlet BCs, ®(boundary) is 
given. Therefore, we make G unique by choosing G(boundary; r’) = 0, so the second surface term vanishes. 
Figure 4.5c illustrates G(n, r') along n, the normal coordinate to the boundary surface. This BC for G 
guarantees that O(r) from (4.8) meets the given ®(boundary). 


Neumann BCs: d®/dn' = E, is given everywhere on the boundary. This is equivalent to specifying E, 


or the surface charge density o everywhere on the boundary, because: 


ay =-E, =-420 (gaussian units) . 

dn' 
You might think we choose dG/dn' = 0 everywhere on the boundary, so the first term in the surface integral 
would vanish. This turns out to be a contradiction, so it fails to give a solution ( [Jac 1999 p39] or [Bra p174], 
but note they use different 6-function conventions from each other, and from us). The contradiction appears 
from Gauss’ Law applied to the definition of the Green function, for r inside the volume: 


WGar)=VeVG=S(r-r) => {p VGai'd?s'= | [I S(r—r) dr or 
OVol Vol 


So dG/dn' cannot be 0 everywhere. The simplest requirement to state (not necessarily to solve) is dG/dn' = 
constant = 1/S, where S = surface area, which satisfies the above surface integral. However, if the system is 
infinitely large, as is commonly approximated, this reduce to the simple dG/dn' = 0. 


The final solution then comes from the fact that O(r') is defined by Neumann BCs only up to an additive 
constant. Therefore, there exists some ®(r') such that the first term in the surface integral of (4.8) is zero. 
Then that ® satisfies: 


or) = [ff ase) (4zpr))a*r'— fh Grr) as" 
— Vol (ey Vol on' 

r inside s(r) 

volume 


This gives the solution ®(r) inside the volume as an integral of the given Neumann BCs. 
The BCs we choose for the Green function depends only on the type of BCs for ® 
(Dirichlet or Neumann), but not on the boundary values themselves. 
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Derivation of Green’s Theorem 


This section is optional. Green’s theorem relates a certain kind of volume integral to a surface integral. 
We start with a one-dimensional section of 3D space, which may be easier to think about (Figure 4.5). 
Consider any two functions ®(x’) and y(x’); we use primes to indicate coordinates of source charges. From 


simple integration by parts, we have: 
bf qd d 
® dx’. 
Nes le) 


We could just as well swap the roles of ® and y, and have: 


bf d d 
@ dx’. 
is le) i. 


Subtracting the latter from the former cancels the integral on the RHS: 


bq? b 
| rae y dx'=® ag y 
a dx' dx |, 


F | b 


[iv Sod 
® dx'= @ 
a ae Yas 


a 


bf g? d? d d .\P 
| o-"-yY_z° a'=(0 -y-y—® or 
a dx' dx' dx dx a 
bq a b (4.9) 
Noe dx'=[y 50 dx'+| Oy -y —o 
x 
charge . 
density 


We recognize the charge density in the first integral on the right. We can isolate ® on the left, at a specific 
point x (not x’) in the volume, by choosing y such that d?y/dx? = 5°(x — x'); in other words, by choosing y(x') 
to be a Green function: 
| | °G , 
y(x') > G(x; x') such that: — 5 OSs) . 
Ox' 

For purposes of Green’s Theorem, ‘x’ is a constant; x' is the variable. Green’s Theorem holds for any 
functions ®(x') and w(x’), so it holds for this choice of y. Then the LHS of (4.9) becomes: 


bq? b 
| ov a= | @(x) 5(x—x)dx'= (x). (4.10) 
a Ly a 


So (4.9) becomes an explicit integral for B(x), x inside the volume: 


b 
(4.11) 


a 


M(x) = [ oo: x')(—4p) dx' + ree —G(x; ee 
a dx' dx' 


To generalize this to 3D, we throw in the requisite vector identities, and upgrade each term in our 
development by two additional dimensions. Start by deriving a kind-of 3D version of integration by parts: 


il V'A(r') d?r'= ¢p A(r')hi'd?S' where  fi'=unit vector pointing outward. 
a oVol 
scalar 


Let A(r')=0(r') V'y(r') > 


Use a vector identity for divergence: 
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Vi(OV'y)=V'OV' y+ OV" y > 


NMJ (wovy rev? v) aps ff, on Mars: 


As in our 1D warmup, we can swap the roles of ® and y, and subtract the result from the above. The first 
term on the LHS cancels, leaving: 


[J (eer? verve? ory ]atr = fh [ory Hin ars 
Vol Vol on' On' 
Now choose y(r') — G(r; r'), where r is a constant inside the volume, and: 


V" Gir’) =S(r-r') > 


a a ie gic OG _ O® \ 20, , 
ee = ff ere canoer er af, [rng ater ts 
inside s(r 
volume 


The particular G we use depends on the BCs given in the original problem for ®(r), as shown in the previous 
section. 


Desultory Green Topics 


Fourier Series Method for Green Functions 


In some cases, we cannot find the Green function in closed form, but we can find a Fourier series for it. 
This section assumes you are familiar with Fourier Series, and Green functions without Fourier Series. The 
example below constructs a Green function from a 2D Fourier Series for the x-y parts, and for each Fourier 
component, uses a variant of 1D left-right construction (introduced in an earlier section) for the z part of that 
component. 


To illustrate the Fourier method for Green functions, we expound on the question [Jac Q2.23 and p128- 
9]. There are many solutions for Q2.23 (which has no source charge) posted on the internet; most use 
separation of variables and eigenfunctions. (We describe such a method generally in Funky Electromagnetic 
Concepts.) We here derive one form of the Green function for such a problem [Jac 3.168 p129m]. In 
principle, this solves for the potential O(r) for arbitrary charge density by using (4.8). 


. Gigz) 


(a) x “\z=constant (b) 


Figure 4.6 (a) A cube with specified boundary potentials. (b) Green function for the z-direction, 
requiring sinh functions. 


The system is a cube of side a, with one comer at the origin, extending to (a, a, a) (Figure 4.6). The 
cube has arbitrary charge density p(r) inside. The two faces of z = 0 and z = a are at fixed potential ® = V, 
and the other 4 faces are at ® = 0. Find the potential inside the cube. As with many such problems, it is 
slightly ill-posed: the potential along the x and y axes, and 6 other similar edges, are specified as both 0 and 
V. We can ignore this by saying that the faces with ® = V are separated by a tiny distance from the rest of 
the cube, so the edges don’t quite touch. 


The geometry favors rectangular coordinates. The BCs on ©® are Dirichlet (© is given everywhere on 
the surface of the cube), so the BCs on G are all zero. This means the three coordinate directions are all 
equivalent for G, and we could find G as a 3D Fourier series [Jac 3.167 p129]. However, the original problem 
is given with z chosen as having different BCs than x and y, so we choose to treat z differently than x and y. 
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We will Fourier expand the x-y surfaces (2D), but write the z-dependence of each Fourier component (of G) 
directly. This is desirable, because lower dimensional series usually converge faster than higher dimension. 


2D Fourier Series: Recall that a well-behaved 2D function of a rectangular region of space 
xE [0,a], ye [0,b] can be written as a series of sinusoids: 


f@a%y= > Aim sin( + sin( r) + other cos() terms we won't need here . 
a 


I,m=1 
basis function 


We justify the lack of cos( ) shortly. Given the function f(x, y), we can find the coefficients Aj, of its series 
expansion from orthogonality of the Fourier basis functions: 


b a 
Aim = =a) dy | dx f (x,y) sin( x) sin{ x) (4.12) 
ab 40 0 a b 
The leading coefficient above is the inverse of the normalization of the basis function: 
2 

b a 
J dy | dx| sin! “ix |sin 7 my ae 

0 0 a b 4 


3D Green function: For the potential of the cube, (x, y, z), we seek the Green function, which looks 
like this in rectangular coordinates: 


V7G(xy, 25952) =H (1) = 6(x—-2x')6(y—y')6(z-z’). (4.13) 


In our parlance, we say G() is the piece of © at (x, y, z) due to the piece of source at (x’, y’, z’). As described 
in a previous section, G (as a whole) does not separate into X(x)Y(y)Z(z). However, each Fourier component 
of G is a solution to Laplace’s equation everywhere except at r’, so each component can be separated into 
X(x)¥(y)Z(z), while still including a discontinuity. In such a separation for the V7 operator in rectangular 
coordinates, at least one function is sin/cos, and at least one is sinh/cosh. Because we chose to Fourier expand 
x-y, they must be sin/cos, and therefore Z(z) must be sinh/cosh. Thus G can be written: 


2 _ (In. 
Gaxyax'y'z= >) Antsy sin{ Mx in(  y\Zy (22). (4.14) 
I,m=1 - id sinh/cosh 


Note that each pair of values (x’, y') has its own distinct Fourier series. We call the z part of each component 
of the Green function Zj,(z; z'). Note that each /m component has a different Zin, which is why there is no 
global Z(z) that can be separated from the rest of G. The units of the coordinates are [x] = [y] = [z] = distance, 
and [AimZim] are [x}-!. 


As noted earlier, the BCs for ® given in the problem define the BCs for G( ), which then makes G( ) 
unique. We must impose G( ) = 0 everywhere on the boundary (all 6 faces): 


G(boundary; x', y',z') =0, Vx',y',Z'. 


Each Fourier component satisfies Laplace’s equation everywhere except at (x’, y’, z'), and is zero on the 
boundaries. The BC on G demands a square slice of z = constant has G = 0 around its perimeter. This can 
be satisfied with X and Y = sin(_ ), but not cos( ). Thus: 


X py (X) =sin( 2}, Yim) =sin( = y}, 1,minteger . 
a a 


The infinite Fourier sum in X and Y compose a d(x — x')d(y — y’'), leaving only 6(z — z') to be constructed in 
Zim. In rectangular coordinates, Xim depends only on /, and Yi, depends only on m. We retain the “/m” on 
both because other coordinate systems don’t separate so cleanly. (Yim here is not a spherical harmonic.) 
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Now to find G, “all” we must do is find the Zi, and Am. Zim must provide the d(z — z’) in (4.13), so we 
start there. The Z;, must look like Figure 4.6b, because they are zero at z = 0 and z =a, and each must have 
a positive step in its derivative at z= z'. We already know that Z),(z) comprises only sinh/cosh, but because 
G(boundary) = 0, it must be made of only sinh. From Figure 4.6b: 


For z<z': Zim (ZZ) = Asinh(KjmZ) 
For z>z': Zim (Zs Z') = Bsinh (kj, (a —z)) where kj, =—NI- +m? . 
a 
kim 1s chosen to cancel the sum of the eigenvalues from X(x) and Y(y), as described in the section on boundary 
value problems. Since kim depends on the component “/m’”, each Zjm is a different function. 


It is customary to combine these two pieces of Z),, into a single form: 
Zim(Z3 2) = Csinh(ky,2<) sinh (kip (a -z,)) where z.=min(z,z'), z, =max(z,z’). 


Remember that for purposes of derivatives, z' is a given constant, so in the above form, one factor is a function 
of z, and the other is just a constant that depends on z'. (This combined form looks clumsy, but is helpful 
with deeper concepts of self-adjointness which we do not pursue here.) The coefficient C could be absorbed 
into the Fourier coefficients Aj, but we have to do the work sooner or later. Therefore, we opt to keep all 
the z-dependence tidily in Zim, so we find C now: 


dZ im 
dz 


: dZ . 
= Ckyp, cosh(kj,Z) sinh (kj,,(a—z')), ry = —Ch yp, Sinh(kjyZ') COSH (Kj (a — 2). 
Z 


Z<z Z>z' 


The unit step in derivative at z' gives: 


= dZ. lm 
dz 


dZ Im 
dz 


+ 


1 


=—CKip, [ sinh(k;,,< cosh (kj, (a — z')) + cosh(kj,z') sinh (kj, (a — z ))] 


Use: sinh(u + v) = sinh wu cosh v + cosh usinh v: 
—1 


1=—-Ck,,, sinh (kj, (z'+a— 2 <~F,, sinh (Fa) 
Im S1 ( Im (z eS )) = Kim sinh (kim@) 


Note that C depends on the source point z', and is negative, as shown in Figure 4.6b. The complete Zj» is: 


Zim (23 2) = sinh(k,,,2) sinh (kj, (a—<, )). (4.15) 


-1 
kn sinh (kj,,4) 
As expected, Zim is not a 1D Green function, because it is non-zero everywhere inside the boundary. 
However, it does provide the discontinuity required in G. In fact: 


a 2 a '. 6 
9 Zim (Z # 2's Z') = Kin Zim> 7 Zim(Z = 252) =0(2-2z') iad I. 7 Zim dz=1. 
OZ Oz zZ'_ OZ 


2 
(We could say, in general —~Z,,,(z;z') = Kae Sin +6(z-z').) [kim] = [z}"', so the scaling of Zim gives it 
Oz" 


units of distance, [z]. 


For Aim(2’, y') we have: 


eo ef &\< ee om ee ' ; 
6-21 F] ¥ ster sn( Zt} ( Eom Po = x'\d(y - y')6(z-z') 


I,m=1 


0 [272 2,2 2) 1 
>y An ‘ S a + a sin( = x)sin( yp, = 5(x—x')O(y— y)5(z-2’). 
a a OZ a a 


I,m=1 
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This means the Aj have units of [x]?. To pick out a single coefficient Aji, we multiply both sides by the 
Fourier basis function, and integrate over the x-y region, recalling the basis function normalization is a7/4: 


Pr mn @& \a’ apa _(l'z . (ma ; : , 
Abin + Zin =| | dx dy sin} ——x |sin y |0(x-xd(y—- y')6(z-z') 
4 040 a a 


ae ae a2" 
_({l'a_,\. (m'z, : 
= sin} ——x' |sin} —— y'|d(z-z'). 
a a 


2 
The only term that survives on the left is from 2. (z =z')=6(z-z'), which cancels the 6( ) on the right: 
oz 


Ang (sy) Set = sin{ "sin ™ 9 Z). 
a a 


a 
a 
(Equivalently, we could integrate both sides with | _ ()dz.) We drop the primes from /' and m’', yielding: 


Ayn (X'.Y') = Ssin( x" sin{ y'), 
a a a 


The final Green function combines these Aj, with Zj, from (4.15): 


ee a sinh(k,,,z_-)sinh(k,,, (a—z 
G(x, y,23Z'y'z')= bs =sin{ "}sin( y"sn[ Mx sin( y] as ( in »)) 


mee a a kim sinh (kj,,4) 


Using (4.8), this G gives ®(r) in integral form for arbitrary p(r) and Dirichlet BCs. 


Green-Like Methods: The Born Approximation 


In the Born approximation, and similar problems, we have our unknown function, now called y(x), on 
both sides of the equation. So both our unknown function f(x) — w(x), and our source s(x) > y(x): 


(1) L{y@}=ye). 


The theory of Green functions still works, so that: 
yx)= | we)GCesx) dx", 


but this doesn’t solve the equation, because we still have y on both sides of the equation. We could try 
rearranging Eq (1): 


L{y(x)}-y(x) =0 which is the same as 
L'{y(x)} =0, with L'{y(x)}= L{y(x)}-y(). 


But recall that Green functions require a nonzero source function s(x) on the right-hand side. The method of 
Green functions can’t solve homogeneous equations, because s(x) = 0 yields: 


L{y(x)} = s(x) =0 > vio=[ s(x G(x;x') dx'= | 0 dx'=0. 
Technically, this is a solution, but it’s not very useful. So Green functions don’t work when y(x) appears on 


both sides. However, under the right conditions, we can make a useful approximation. If we have an 
approximate solution, 


Lly hey, 
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then we can expand y in a perturbation series of corrections: 
1 2 
u(x) =y (x) ty (x) ty (x) +... 
where yw" is 1° order perturbation, y is 2™ order, .... 


Now we can use y(x) as the source term, and use a method like Green functions, to get a better 
approximation to w(x): 


Livia} =v = yy ay= | vO RIGOx) dx’ 
(4.16) 


where  G/(x;x')is the Green's function for Z, i.e. L{G(x sx} =0(x-x’'). 


yx) + w(x) is called the first Born approximation of y(x). This process can be extended to arbitrarily 
high accuracy. 


In QM, C£ is the perturbed hamiltonian: 
L=Hy,+V(r), 


where V(r) is “small” compared to H o- y is an exact solution to the unperturbed Schrodinger equation, so 


it can be shown that the Born approximation (4.16) reduces to: 
y (x) =| yO (x')G(x ;x') dx' 


y%o=(yP@reaixa’ .. ye P@=[ y@rEGs2) a’ 


This process assumes that the Green function is “small” enough to produce a converging sequence. The first 
Born approximation is valid when y“?(x) << w(x) everywhere, and in many other, less stringent but harder 
to quantify, conditions. The extension to higher order approximations is straightforward: the Born 
approximation is valid when w(x) << yx). See Quirky Quantum Concepts for detailed information. 


TBS: a real QM example? 


Green function as inverse operator?? 
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5 Complex Analytic Functions | 


For a review of complex numbers and arithmetic, see Quirky Quantum Concepts. 


Notation: In this chapter, z, w are always complex variables; x, y, r, 9 are always real variables. Other 
variables are defined as used. 


A complex function of a complex variable f(z) is analytic over some domain if it has an infinite number 
of continuous derivatives in that domain. It turns out, if f(z) is once differentiable on a domain, then it is 
infinitely differentiable, and therefore analytic on that domain. 


A necessary condition for analyticity of f(z) = u(x, y) + iv(x, y) near Zo is that the Cauchy-Riemann 
equations hold, to wit: 


of . Of Ou .Ov .{ Ou ,Ov .Ou Ov Ou Ov ov Ou 
=i - > = - 


> +i—=-i +i -i—+ ; : 
Ox oy Ox Ox Oy oy oy oy ox oy Ox oy 


A sufficient condition for analyticity of f(z) = u(x, y) + iv(x, y) near zo is that the Cauchy-Riemann 
equations hold, and the first partial derivatives of fexist and are continuous in a neighborhood of zo. Note 
that if the first derivative of a complex function is continuous, then all derivatives are continuous, and the 
function is analytic. This condition implies 


V-u=V2v=0 


Vu-Vv=0 > "level lines" are perpendicular 
s f(z) dz is countour independent if f(z) is single-valued 
zy 


Note that a function can be analytic in some regions, but not others. Singular points, or singularities, 
are not in the domain of analyticity of the function, but border the domain [Det def 4.5.2 p156]. E.g., Vz is 
singular at 0, because it is not differentiable, but it is continuous at 0. Poles are singularities near which the 
function is unbounded (infinite), but can be made finite by multiplication by (z — zo)‘ for some finite k [Det 
p165]. This implies f(z) can be written as: 


-k -k+l -1 1 
F(Z) =a, (Z- Zp)” + Ay {(Z-%) +..44_4(Z-Z%) +dg+a,(z—-Zp) +... 


The value k is called the order of the pole. All poles are singularities. Some singularities are like “poles” 
of infinite order, because the function is unbounded near the singularity, but it is not a pole because it cannot 
be made finite by multiplication by any (z — zo)* , for example e’”. Such a singularity is called an essential 
singularity. 


A Laurent series expansion of a function is similar to a Taylor series expansion, but the sum runs from 
—oo to +00, instead of from 1 to 00. In both cases, an expansion is about some point, Zo: 


— (n) 
Taylor series: f= f(z) + Dob, (z-z% y where b, = Fr" Gag) 


! 
n=l Me 


foe) 


. 1 f(z) 
Laurent series: z= a,(z-z)', where a.=—— i" de 
F@) > n ( 0) "Oni J around 2 (z = ad 


n=O 

[Det thm 4.6.1 p163] Analytic functions have Taylor series expansions about every point in the domain. 

Taylor series can be thought of as special cases of Laurent series. But analytic functions also have Laurent 

expansions about isolated singular points, i.e. the expansion point is not even in the domain of analyticity! 

The Laurent series is valid in some annulus around the singularity, but not across branch cuts. Note that in 
general, the a; and b; could be complex, but in practice, they are often real. 


Properties of analytic functions: 
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1. Ifitis differentiable once, it is infinitely differentiable. 


2. The Taylor and Laurent expansions are unique. This means you may use any of several methods to 
find them for a given function. 


3. If you know a function and all its derivatives at any point, then you know the function everywhere 
in its domain of analyticity. This follows from the fact that every analytic function has a Laurent 
power series expansion. It implies that the value throughout a region is completely determined by 
its values at a boundary. 


4. An analytic function cannot have a local extremum of absolute value. (Why not??) 


Residues 


Mostly, we use complex contour integrals to evaluate difficult real integrals, and to sum infinite series. 
To evaluate contour integrals, we need to evaluate residues. Here, we introduce residues. The residue of a 
complex function at a complex point Zo is the a_; coefficient of the Laurent expansion about the point Zo. 
Residues of singular points are the only ones that interest us. (In fact, residues of branch points are not 
defined [Sea sec 13.1].) 


Common ways to evaluate residues 


1. The residue of a removable singularity is zero. This is because the function is bounded near the 
singularity, and thus a_; must be zero (or else the function would blow up at Zo): 


Fora_,#0,as z42Z%, ay > = a_=0. 


£— % 
2. The residue of a simple pole at zo (i.e., a pole of order 1) is 


a_, = lim (z—z) f(z). 
2% 


3. Extending the previous method: the residue of a pole at Zo of order k is 
= lim Gia 4 Z); 

(k=1)! 3% ak (z-2) F@) 
which follows by substitution of the Laurent series for f(z), and direct differentiation. We show it 
here, noting that poles of order m imply that ax; = 0 for k < —m, so we get: 


a_} 


f(z) = ag (2-29) * +ag_4(Z- 29) $1. a4 (Z— 2%) | +g +4,(Z— 2)! +... 
(z= z)" [O= + ay_4(Z— 2%)! tba e=egh ise eay Sale—eyy +... 
i k 4k! k+1)! 
gilt) f(2=(k-M!a_1(z—z)* rp 1 20% at + a (z Zo) Pes 
IZ . 
k-l " 
lim —+(z-2) f@=(k-Ila 
z2% dz 
1 ae k 
> a_, =————. lim ——(z--z z 
: (k-1)! 22% azk" ( 0) aa 
4. Iff(z) can be written as f(z) = oO” where P is continuous at zo, and QO ’(zo) # 0 (and is continuous 
Zz 


at Zo), then f(z) has a simple pole at zo, and 
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Res f(z) = Po) = PG) 


Z=% a ~ OZ) 
de O(z) 


Why? Near zy, Q(z) ¥(z—Z9)O'(zo). 


P(Z) = P(29) ; 
Z-%)O(%) O'Z) 


Then: Res f(z)= lim (z-z)f( = lim (z zo) 

Z=% Z>%H Z—>% ( 

5. Find the Laurent series, and hence its coefficient of (z—zo)"!. This is sometimes easy if f(z) is given 

in terms of functions with well-known power series expansions. See the sum of series example 
later. 


We will include real-life examples of most of these as we go. 


Contour Integrals 


Contour integration is an invaluable tool for evaluating both real and complex-valued integrals. Contour 
integrals are used all over advanced physics, and we could not do physics as we know it today without them. 
Contour integrals are mostly useful for evaluating difficult ordinary (real-valued) integrals, and sums of 
series. In many cases, a function is analytic except at a set of distinct points. In this case, a contour integral 
may enclose, or pass near, some points of non-analyticity, i.e. singular points. It is these singular points that 
allow us to evaluate the integral. 


You often let the radius of the contour integral go to 00 for some part of the contour: 


imaginary 
Cr 


>real 


Any arc where 


1 
li he et 
fim |f@) \z|"*? a ae 0 


has an integral of 0 over the arc. 


Pron non nnn nnn nnn nnn nnn 


Beware that this is often stated incorrectly as “any function which goes to zero faster than 1/|z| has a 
contour integral of 0.” The problem is that it has to have an exponent < —1; it is not sufficient to be 


simply smaller than 1/|z|. E.g. Fre < a , but the contour integral still diverges. 


Jordan’s lemma: ??. 


Evaluating Integrals 


Surprisingly, we can use complex contour integrals to evaluate difficult real integrals. The main idea is 
to find a contour which: 


(a) includes some known (possibly complex) multiple of the desired (real) integral, 
(b) includes other segments whose values are zero, and 
(c) includes a known set of poles whose residues can be found. 


Then you simply plug into the residue theorem: 
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p. F(2) dz =2zi > Res f(z), where  z,, are the finite set of isolated singularities . 


. Zz, 
n residues “" 


We can see this by considering the contour integral around the unit circle for each term in the Laurent series 
expanded about 0. First, consider the z° term (the constant term). We seek the value of >, dz. dzisasmall 


complex number, representable as a vector in the complex plane. Figure 5.1a shows the geometric meaning 
of dz. Figure 5.1b shows the geometric approximation to the desired integral. 


imaginary 
: dz = cil) ag wer oe 
\ unit dz, “ 
circle v Wiz 
\ + 4 
do \ xX fen 
\ 2 
4 ~ A 
> real YS, aw | 
(a) | (b) 


Figure 5.1 (a) Geometric description of dz. (b) Approximation of >. dz as a sum of 32 small 


complex terms (vectors). 


We see that all the tiny dz elements add up to zero: the vectors add head-to-tail, and circle back to the starting 
point. The sum vector (displacement from start) is zero. This is true for any large number of dz, so we have: 


,dz=0. 


: 1 ; . . 
Next, consider the z~! term, f (+) dz , and a change of integration variable to 0: 
OV.Z 


‘ . 2 . ‘ 2. 
Let z=e, de=iea0: > (=) de=f “oie do= | “dO =n. 
O\.Z 0 0 


The change of variable maps the complex contour and z into an ordinary integral of a real variable. 


Geometrically, as z goes positively (counter-clockwise) around the unit circle (below left), z! goes around 
the unit circle in the negative (clockwise) direction (below middle). Its complex angle, arg(1/z) =—0, where 
z=e. As z goes around the unit circle, dz has infinitesimal magnitude s = d0, and argument 0 + 7/4. Hence, 
the product of (1/z) dz always has argument of —0 + 6 + 1/4 = n/4; it is always purely imaginary. 
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tna uiahy Imaginary Imaginary 
Path of z = e” f Path of z = rn 


around unit - e“’ around 
praeem unit circle Path of dz = 
sie’ around 
unit circle lp 
= 4 real « a real - aap ae 
E [A D 


Dy B v 


Paths of z, 1/z, and dz in the complex plane 


The magnitude of (1/z) dz = dQ; thus the integral around the circle is 277. Multiplying the integrand by some 
constant, a_; (the residue), just multiplies the integral by that constant. And any contour integral that encloses 
the pole 1/z and no other singularity has the same value. Hence, for any contour around the origin 


> ae dz 
oO 


= . 
az dz=27i(a_ => a4,= 
p, : ( 1) : 2ni 


Now consider the other terms of the Laurent expansion of f(z). We already showed that the ao 2° term, 
which on integration gives the product do dz, rotates uniformly about all directions, in the positive (counter- 
clockwise) sense, and sums to zero. Hence the ao term contributes nothing to the contour integral. 


The az! dz product rotates uniformly twice around all directions in the positive sense, and of course, still 
sums to zero. Higher powers of z simply rotate more times, but always an integer number of times around 
the circle, and hence always sum to zero. 


Similarly, a2z7, and all more negative powers, rotate uniformly about all directions, but in the negative 
(clockwise) sense. Hence, all these terms contribute nothing to the contour integral. 


So in the end: 


The only term of the Laurent expansion about 0 that contributes to the contour integral is the residue 


term, a 71. 
: : aan | 
The simplest contour integral: Evaluate J = | 5 aX . 
0 x +1 


We know from elementary calculus (let x = tan u) that J = 2/2. We can find this easily from the residue 
theorem, using the following contour: 


imaginary 


real 


“C” denotes a contour, and “J” denotes the integral over that contour. We let the radius of the arc go to 
infinity, and we see that the closed contour integral Jc=1+1+ Ip. But Ip = 0, because f(R — «) < 1/R?. Then 
T=Ic/2. f(z) has poles at + i. The contour encloses one pole at i. Its residue is 
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1 1 1 1 
Res f(i)= = =—. Ic =2ai) Res f(z,)=2a2i—=2 
i@)=4— ae c= 2a) Res f (z,) = 2a1 
—|z +1) z=l n 
dz = 
ee 
2» 2 


Note that when evaluating a real integral with complex functions and contour integrals, the i’s always 
cancel, and you get a real result, as you must. It’s a good check to make sure this happens. 


Choosing the Right Path: Which Contour? 


The path of integration is fraught with perils. How will I know which path to choose? There is no 
universal answer. Often, many paths lead to the same truth. Still, many paths lead nowhere. All we can do 
is use experience as our guide, and take one step in a new direction. If we end up where we started, we are 
grateful for what we learned, and we start anew. 


We here examine several useful and general, but oft neglected, methods of contour integration. We use 
some sample problems to illustrate these tools. This section assumes a familiarity with contour integration, 
and its use in evaluating definite integrals, including the residue theorem. 


o .2 
Example: Evaluate / -| ea. 


x 


The integrand is everywhere nonnegative, and somewhere positive, and the limits are in the positive 
direction, so J must be positive. We observe that the given integrand has no poles; it has only a removable 
singularity at x = 0. If we are to use contour integrals, we must somehow create a pole (or a few), to use the 
residue theorem. Simple poles (i.e. 1*-order) are sometimes best, because then we can also use the indented 
contour theorem. 


Imaginary 


Imaginary 
a 


Contours for the two exponential integrals: (left) positive (counter-clockwise) exp(2z); 
(right) negative (clockwise) exp(—2z) 


To use a contour integral (which, a priori, may or may not be a good idea), we must do two things: (1) 
create a pole; and (2) close the contour. The same method does both: expand the sin( ) in terms of 
exponentials: 


’ avd 
© gin2 (oe) e*=e% oo pl2z (oo) wo pi2z 
=-[ Sa] | 5 i dz=—+ [ Se-f a+ edz}. 
-0 Xx —0o (2i) Zz 4 —0 7 —0 7 -0O 7 


All three integrals on the RHS have poles at z= 0. If we indent the contour underneath the origin, then since 
the function is bounded near there, the limit as r — 0 leaves the original integral unchanged (above left). The 
first integral must be closed in the upper half-plane, to keep the exponential small. The second integral can 
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be closed in either half-plane, since it ~ 1/z*. The third integral must be closed in the lower half-plane, again 
to keep the exponential small (above right). Note that all three contours must use an indentation that preserves 
the value of the original integral. An easy way to insure this is to use the same indentation on all three. 


Now the third integral encloses no poles, so is zero. The 2™ integral, by inspection of its Laurent series, 
has a residue of zero, so is also zero. Only the first integral contributes. By expanding the exponential in a 
Taylor series, and dividing by z’, we find its residue is 2i. Using the residue theorem, we have: 


© ein2 
pa [7 Fae =—J[2ai(2i)]= 2. 


Example: Evaluate J = Io ese) sss dx [B&C p?? QI]. 


x 
This innocent looking problem has a number of funky aspects: 
e The integrand is two terms. Separately, each term diverges. Together, they converge. 


e The integrand is even, so if we choose a contour that includes the whole real line, the contour integral 
includes twice the integral we seek (twice J). 


e The integrand has no poles. How can we use any residue theorems if there are no poles? Amazingly, 
we can create a useful pole. 


e Atypical contour includes an arc at infinity, but cos(z) is ill-behaved for z far off the real-axis. How 
can we tame it? 


e We will see that this integral leads to the indented contour theorem, which can only be applied to 
simple poles, i.e., first order poles (unlike the residue theorem, which applies to all poles). 


Each of these funky features is important, and each arises in practical real-world integrals. Let us consider 
each funkiness in turn. 


1. The integrand is two terms. Separately, each term diverges. Together, they converge. 
Near zero, cos(x) ~ 1. Therefore, the zero endpoint of either term of the integral looks like 


anywhere 
1 ry 


anywhere COS AX anywhere | 
i} 5 dx ~ } 5 dx =- > +00. 
0 x 0 x X\o 


Thus each term, separately, diverges. However, the difference is finite. We see this by power series 
expanding cos(x): 


a age | MO ee ap - 
_ 2 2 2 72 

cos(ax)—cos(bx) a oe +0(x)=" is +0(2") > 
2 ae: 2 


2 
anywhere COS(ax) — Cos(bx be - eae . ; 
J iad, aa dx ~ which is to say, is finite. 


0 x2 
2. The integrand is even, so if we choose a contour that includes the whole real line, the contour 
integral includes twice the integral we seek (twice J). 


Perhaps the most common integration contour (below left) covers the real line, and an infinitely distant 
arc from +00 back to —co. When our real integral (/ in this case) is only from 0 to o, the contour integral 
includes more than we want on the real axis. If our integrand is even, the contour integral includes twice the 
integral we seek (twice J). This may seem trivial, but the point to notice is that when integrating from 
—o to 0, dx is still positive (below middle). 
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imaginary f(x) even 
LV. : i, g 


(Left) A common contour. 
(Right) An even function has integral over the real-line twice that of 0 to infinity. 


Note that if the integrand is odd (below left), choosing this contour cancels out the original (real) integral 
from our contour integral, and the contour is of no use. Or if the integrand has no even/odd symmetry (below 
middle), then this contour tells us nothing about our desired integral. In these cases, a different contour may 
work, for example, one which only includes the positive real axis (below right). 


fix) odd f(x) asymmetric imaginary 
x x 
dx >0 
real 


(Left) An odd function has zero integral over the real line. (Middle) An asymmetric function has 
unknown integral over the real line. (Right) A contour containing only the desired real integral. 


3. The integrand has no poles. How can we use any residue theorems if there are no poles? 
Amazingly, we can create a useful pole. 


This is the funkiest aspect of this problem, but illustrates a standard tool. We are given a real-valued 
integral with no poles. Contour integration is usually useless without a pole, and a residue, to help us evaluate 
the contour integral. Our integrand contains cos(x), and that is related to exp(ix). We could try replacing 
cosines with exponentials, 

exp(iz) + exp(-iz) 


cos Z= p (does no good). 


but this only rearranges the algebra; fundamentally, it buys us nothing. The trick here is to notice that we 
can often add a made-up imaginary term to our original integrand, perform a contour integration, and then 
simply take the real part of our result: 


b b 
Given 1 =|. g(x) dx, let f (z)=8(z) +ih(2). Then [= Rel f(z) az]. 


For this trick to work, ih(z) must have no real-valued contribution over the contour we choose, so it 
doesn’t mess up the integral we seek. Often, we satisfy this requirement by choosing ih(z) to be purely 
imaginary on the real axis, and having zero contribution elsewhere on the contour. Given an integrand 
containing cos(x), as in our example, a natural choice for ih(z) is i sin(z), because then we can write the new 
integrand as a simple exponential: 


cos(x) > f(z) = cos(z) +isin(z) = exp(iz) . 


In our example, the corresponding substitution yields 


2 


l= f % cos ax—cosbx aye 1 =Re f 0 exp(iax) — exp(ibx) elt, 
0 p 0 2 


x 


Examining this substitution more closely, we find a wonderful consequence: this substitution introduced 
a pole! Recall that 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 72 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


3 


; Zz isinz (1 z 
me se > —— =i| —--— +... }. 


e* z 3! 


We now have a simple pole at z = 0, with residue i. 


By choosing to add an imaginary term to the integrand, we now have a pole that we can work with to 
evaluate a contour integral! 


It’s like magic. In our example integral, our residue is: 


2 


isinaz—isinbz .(a—b 
=i + 
Zz 


“) and residue = i(a —b) . 
z 


Note that if our original integrand contained sin(x) instead of cos(x), we would have made a similar 
substitution, but taken the imaginary part of the result: 


b. . b 
Given I= sin(x) dx, let f (z) =cos(z)+isin(z). Then r=im{f? fae}. 


4. A typical contour includes an arc at infinity, but cos(z) is ill-behaved for z far off the real-axis. 
How can we tame it? 


This is related to the previous funkiness. We’re used to thinking of cos(x) as a nice, bounded, well- 
behaved function, but this is only true when x is real. 


When integrating cos(z) over a contour, we must remember that 
cos(z) blows up rapidly off the real axis. 


In fact, cos(z) ~ exp(Im{z}), so it blows up extremely quickly off the real axis. If we’re going to evaluate 
a contour integral with cos(z) in it, we must cancel its divergence off the real axis. There is only one function 
which can exactly cancel the divergence of cos(z), and that is + 7 sin(z). The plus sign cancels the divergence 
above the real axis; the minus sign cancels it below. There is nothing that cancels it everywhere. We show 
this cancellation simply: 


Let Z=xtly 
cos z+ isin z = exp(iz) = exp(i(x -~ iy)) = exp(ix) exp(—y) and 
lexp(ix) exp(—y)| = expGéx)] -lexp(—y)| = exp(-y) 
For z above the real axis, this shrinks rapidly. Recall that in the previous step, we added i sin(x) to our 


integrand to give us a pole to work with. We see now that we also need the same additional term to tame the 
divergence of cos(z) off the real axis. For the contour we’ve chosen, no other term will work. 


5. We will see that this integral leads to the indented contour theorem, which can only be applied 
to simple poles, i.e., first order poles (unlike the residue theorem, which applies to all poles). 


We’re now at the final step. We have a pole at z = 0, but it is right on our contour, not inside it. If the 
pole were inside the contour, we would use the residue theorem to evaluate the contour integral, and from 
there, we'd find the integral on the real axis, cut it in half, and take the real part. That is the integral we seek. 


But the pole is not inside the contour; it is on the contour. The indented contour theorem allows us to 
work with poles on the contour. We explain the theorem geometrically in the next section, but state it briefly 
here: 
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imaginary imaginary 


As p > 0, 
[_, f@) dz = G0) (residue) 


< > real < > real 


(Left) A tiny arc around a simple pole. (Right) A magnified view; we let p > 0. 


Note that if we encircle the pole completely, 6 = 27, and we have the special case of the residue theorem for 
a simple pole: 


bF@ dz = 2mi (residue) : 


However, the residue theorem is true for all poles, not just simple ones (see The Residue Theorem earlier). 


Putting it all together: We now solve the original integral using all of the above methods. First, we 
add i sin(z) to the integrand, which is equivalent to replacing cos(z) with exp(iz): 


7 f © Cos ax — cos bx We pope {I «© exp(iax) — exp(ibx) ax} 
0 2 0 2 
Xx Xx 
Define J= Jo ee) a dx, so 1=Re{J} 
Xx 


We choose the contour shown below left, with R > , and p > 0. 


imaginary 


real real 


There are no poles enclosed, so the contour integral is zero. The contour includes twice the desired integral, 
so define: 


_ exp(iaz) —exp(ibz) 
= 5 ' 


f (2) Then f(z)dz=] f(z)dz+ 2+]. fe)dz=0. (6.1) 


For Cr, |f(z)| < 1/R?, so as R — ©, the integral goes to 0. For C,, the residue is i(a — b), and the arc is 
radians in the negative direction, so the indented contour theorem says: 


lim |. f(2)de=-(zi)i(a—b)=x(a-b). 
p0 °C, 
Plugging into (5.1), we finally get 
2J +2(a-b)=0 => 1=Re{J}=7(b-a). 

In this example, the contour integral J happened to be real, so taking J = Re{J} is trivial, but in general, 
there’s no reason why J must be real. It could well be complex, and we would need to take the real part of 
it. 

To illustrate this and more, we evaluate the integral again, now with the alternate contour shown above 


right. Again, there are no poles enclosed, so the contour integral is zero. Again, the integral over Cr = 0. 
We then have: 
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ff) de= J, ede +[, F(zydz+ 3+], f(z) dz=0 


And lim, J, F(@) de =—(ia12)i(a-b) = F(a-b) 


The integral over C2 is down the imaginary axis: 
Let z=xtiy=0+iy =i, then dz =i dy 
_ exp(iaz)—exp(ibz) | _ 0 exp(—ay)—exp(—by) 
[.,f@ dz le : dz =| 


0 2 
~y 

We don’t know what this integral is, but we don’t care! In fact, it is divergent, but we see that it is purely 

imaginary, so will contribute only to the imaginary part of J. But we seek J = Re{J}, and therefore 


i dy 


I= lim Re{J} is well-defined. 
po 


Therefore we ignore the divergent imaginary contribution from C2. We then have 
: ; 1 1 
i (something )+ J + ~(a—b)=0 > I=Re{J}=—(b-a). 

as before. 


Evaluating Infinite Sums 


oo 
Perhaps the simplest infinite sum in the world is S = YS The general method for using contour 
n=l ‘ 
integrals is to find an countably infinite set of residues whose values are the terms of the sum, and whose 
contour integral can be evaluated by other means. Then 


Ic= 2ni> Res f(z,) = 27iS => Sao 


n=l 


The hard part is finding the function f(z) that has the right residues. Such a function must first have poles at 
all the integers, and then also have residues at those poles equal to the terms of the series. 


To find such a function, consider the complex function z cot(zz). Clearly, this has poles at all real integer 
z, due to the sin(zz) function in the denominator of cot(z). Hence, 


. cos(7z,,) cos(7z,, ) 
For z,, = (integer), Res| xcot(z, )| 7 Re sin(7z, ) | “7 mcos(zz, ) ane 
P(q) _- P@) 


where in the last step we used if Q(z)=0 then Res , If this is defined. 


z=% A(z) Q(z) 


Thus z cot(zz) can be used to generate lots of infinite sums, by simply multiplying it by a continuous 
function of z that equals the terms of the infinite series when z is integer. For example, for the sum above, 


1 
S= > —z > we simply define: 


n=1 


f(2= + reot(xz), and its residues are Res f(z,) = a n#0. 
z n 
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[In general, to find > s(n) , define 


n=1 


f@= s(z)| x cot (xz) | , and its residues are Res f(e=s(n). 


However, now you may have to deal with the residues for n < 0.] 


Continuing our example, now we need the residue at n = 0. Since cot(z) has a simple pole at zero, 
cot(z)/z? has a 3 order pole at zero. We optimistically try tedious brute force for an m™ order pole with m = 
3, only to find that it fails: 


mcotmz A 3 TcotrzZ : id’ 
Res = lim Z = lim mzcot rz 
2 z0 z0 2 


0 Zz 2! dz? z 2! dz 


1. 
—sin2az-71z 


mks lim d [ cot zz MZCSC xz | ed lim 
d: 22 


d | cosmzsinzz—-7z 7) d 
2 20 IZ 


sin? zz 2 20 dz sin? 7z 


ce qu. _vdu —Udv 
V v2 
2 1. : 
sin* 7Z (z cos27z 7) sin27z-—7z |2msinzzcos7z 
mcotmz 7. 2 
Res =— lim 
z=0 oz 2 230 sin* 2z 


; 1 
sin 7Z (z cos 27z 7) E sin 27z-—7z |27 cos 7z 
T=: 
=— lim a 
2 230 sin” 7zZ 


Use L’hopital’s rule: 


mcotmz 7. 1 : , 
Res =— lim ncosmz(mcos2az- 7) +sinzz(-27sin27z-1) 
= 2 2 0 - 2 
z=0 oz 20 37 sin” 7ZCOS TZ 
1. 2. 
—(cos22z-—2)2nc0s2z- sin 2az— xz 2n° sin zz 
2 : : a( 1. : 
-n* cos z(cos2az-1)+sinzz(-27sin2xz-1)-22 5sin2az mz |sinzz 
r.. 
=— lim 5 
2 20 3m sin” mzcOS7z 


At this point, we give up on brute force, because we see from the denominator that we’ll have to use 
L’Hopital’s rule twice more to eliminate the zero there, and the derivatives will get untenably complicated. 


But in 2 lines, we can find the a_; term of the Laurent series from the series expansions of sin and cos. 
The z! coefficient of cot(z) becomes the z! coefficient of f(z) = cot(z)/z’: 


228 ag 
gotig= PPE yt i+ (2): rae P12)(i+2*/6)={*\( 27 /3)=+ Z 
snz z—z’/6+.. \Z/1-2z°/6 \Z Z z 3 


1 2 cot 7z n 
cot 7z = — —-— > Res z ee 


mz 3 z=0 Zz 3 


Now we take a contour integral over a circle centered at the origin: (no good, because cot(zz) blows up 
every integer ! ??) 
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imaginary 


PDR OPOP OOO AE real 


As R— «, Ic > 0. Hence: 


—00 


1 ed 1 1 -Ky x 
fo0=2n{ ¥ b+ky +35] = Ky +2))-5=0, geet 


n=-1 n n=l 


Multi-valued Functions 


Many functions are multi-valued (despite the apparent oxymoron), i.e. for a single point in the domain, 
the function can have multiple values. A simple example is a square-root function: given a complex number, 
there are two complex square roots of it. Thus, the square root function is two-valued. Another example is 
arc-tangent: given any complex number, there are an infinite number of complex numbers whose tangent is 
the given complex number. 


[picture??] 


We refer now to “nice” functions, which are locally (i.e., within any small finite region) analytic, but 
multi-valued. If you’re not careful, such “multi-valuedness” can violate the assumptions of analyticity, by 
introducing discontinuities in the function. Without analyticity, all our developments break down: no contour 
integrals, no sums of series. But, you can avoid such a breakdown, and preserve the tools we’ve developed, 
by treating multi-valued functions in a slightly special way to insure continuity, and therefore analyticity. 


A regular function, or region, is analytic and single valued. (You can get a regular function from a 
multi-valued one by choosing a Riemann sheet. More below.) 


A branch point is a point in the domain of a function f(z) with this property: when you traverse a closed 
path around the branch point, following continuous values of f(z), f(z) has a different value at the end point 
of the path than at the beginning point, even though the beginning and end point are the same point in the 
domain. Example TBS: square root around the origin. Sometimes branch points are also singularities. 


A branch cut is an arbitrary (possibly curved) path connecting branch points, or running from a branch 
point to infinity (“connecting” the branch point to infinity). If you now evaluate integrals of contours that 
never cross the branch cuts, you insure that the function remains continuous (and thus analytic) over the 
domain of the integral. 


This solves the problem of integrating across discontinuities. Branch cuts are like fences in the domain 
of the function: your contour integral can’t cross them. Note that you’re free to choose your branch cuts 
wherever you like, so long as the function remains continuous when you don’t cross the branch cuts. 
Connecting branch points is one way to insure this. 


A Riemann sheet is the complex plane plus a choice of branch cuts, and a choice of branch. This defines 
a domain on which a function is regular. 
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A Riemann surface is a continuous joining of Riemann sheets, gluing the edges together. This “looks 
like” sheets layered on top of each other, and each sheet represents one of the multiple values a multi-valued 


analytic function may have. TBS: consider (z - a)(z - b) . 


imaginary 


real 
branch cut 


imaginary 


— —}+—. 
— branch cuts — 


real 
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6 Conceptual Linear Algebra | 


Instead of lots of summation signs, we describe linear algebra concepts, visualizations, and ways to think 
about linear operations as algebraic operations. This allows fast understanding of linear algebra methods that 
is extremely helpful in almost all areas of physics. Tensors rely heavily on linear algebra methods, so this 
section is a good warm-up for tensors. Matrices and linear algebra are also critical for quantum mechanics. 


Caution In this section, vector means a column or row of numbers. In other sections, “vector” has 
a more general meaning. 


In this section, we use bold capitals for matrices (A), and bold lower-case for vectors (a). 


Matrix Multiplication 


It is often helpful to view a matrix as a horizontal concatenation of column-vectors. You can think of it 
as arow-vector, where each element of the row-vector is itself a column vector. 


i i d 
A=l|aibic or A= e 
t t f 
Equally valid, you can think of a matrix as a vertical concatenation of row-vectors, like a column-vector 
where each element is itself a row-vector. 


Matrix multiplication is defined to be the operation of linear transformation, e.g., from one set of 
coordinates to another. The following properties follow from the standard definition of matrix multiplication: 


Matrix times a vector: A matrix B times a column vector v, is a weighted sum of the columns of B: 


B, B, 
Bv= B,, |+v*| B,, 
B B 


We can visualize this by laying the vector on its side above the columns of the matrix, multiplying each 
matrix-column by the vector component, and summing the resulting vectors: 


v v° vel 
By Be Le lie x x x B, By By 
Bv=|B,, B, B,||v’ |=|B, |+]B, | +] B, =v"! B,, |+v’ | By |+v"| By 
By Be Be |v" Be, Day Bi. B,, B,, is 
B,, Be By 


The columns of B are the vectors which are weighted by each of the input vector components, v/. 


Another important way of conceptualizing a matrix times a vector: the resultant vector is a column of 
dot products. The i” element of the result is the dot product of the given vector, v, with the i” row of B. 


Writing B as a column of row-vectors: 
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pers A Sones a pl — rev 
B= _ a - —-> Bv = = ay _ V|=|Toev 
Le) Yr; 13°V 


This view derives from the one above, where we lay the vector on its side above the matrix, but now consider 
the effect on each row separately: it is exactly that of a dot product. 


In linear algebra, even if the matrices are complex, we do not conjugate the left vector in these dot 
products. If they need conjugation, the application must conjugate them separately from the matrix 
multiplication, i.e. during the construction of the matrix. 


We use this dot product concept later when we consider a change of basis. 


Matrix times a matrix: Multiplying a matrix B times another matrix C is defined as multiplying each 
column of C by the matrix B. Therefore, by definition, matrix multiplication distributes to the right across 
the columns: 


Let C=|xiy:z|, then BC=B|x 'y !z|=| Bx : By | Bz 


{Matrix multiplication also distributes to the left across the rows, but we don’t use that as much.] 


Determinants 


This section assumes you’ ve seen matrices and determinants, but probably didn’t understand the reasons 
why they work. 


The determinant operation on a matrix produces a scalar. It is the only operation (up to a constant 


factor) which is (1) linear in each row and each column of the matrix; and (2) antisymmetric under 
exchange of any two rows or any two columns. 


The above two rules, linearity and antisymmetry, allow determinants to help solve simultaneous linear 
equations, as we show later under “Cramer’s Rule.” In more detail: 


1. The determinant is linear in each column-vector (and row-vector). This means that multiplying any 
column (or row) by a scalar multiplies the determinant by that scalar. E.g., 


det|ka: b: c/=kdetla:bic|; and detlat+d:bic|=detla:b:c|+det}/di bic]. 


2. The determinant is anti-symmetric with respect to any two column-vectors (or row-vectors). This 
means swapping any two columns (or rows) of the matrix negates its determinant. 


The above properties of determinants imply some others: 


3. Expansion by minors/cofactors (see below), whose derivation proves the determinant operator is 
unique (up to a constant factor). 


4. The determinant of a matrix with any two columns equal (or proportional) is zero. (From anti- 
symmetry, swap the two equal columns, the determinant must negate, but its negative now equals 
itself. Hence, the determinant must be zero.) 


det} b ib ic|=—det/b ib ic => det}b i: bic|=0. 
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5. det |A| - det [B| = det |AB| . This is crucially important. It also fixes the overall constant factor of the 


determinant, so that the determinant (with this property) is a completely unique operator. 


6. Adding a multiple of any column (row) to any other column (row) does not change the determinant: 


detla+kb : b | c|=detla : b : c|+det\kb | b : c|=detla : b : c|+kdetbb | ¢| =detla ib ic). 


7. det|A + B| 4 det|A| + det|B]. The determinant operator is not distributive over matrix addition. 
8. det|kA| = k” det|Al. 


The ij-th minor, Mj, of an nxn matrix (A = Agp) is the product Aj times the determinant of the (n—1)x(n—- 
1) matrix formed by crossing out the i-th row and j-th column: 


7 column 


i” row 


A cofactor is just a minor with a plus or minus sign affixed: 


Cj = (-)' My, = (DA; det [4] without i” row and j” column |. 


Cramer’s Rule 


It’s amazing how many textbooks describe Cramer’s rule, and how few explain or derive it. I spent years 
looking for this, and finally found it in [Arf ch 3]. Cramer’s rule is a turnkey method for solving simultaneous 
linear equations. It is horribly inefficient, and virtually worthless above 3 x 3, however, it does have 
important theoretical implications. Cramer’s rule solves for n equations in n unknowns: 


Given Ax=b, where Aisa coefficient matrix, 
xis a vector of unknowns, x; 


b is a vector of constants, b, 


To solve for the i” unknown x;, we replace the i column of A with the constant vector b, take the 
determinant, and divide by the determinant of A. Mathematically: 


Let A=[a, ja, j + ia, where a, is the i” column of A. We can solve for x; as 


deta, ia Aj | b i Ajype 1 a, 


= where a, is the i” column of A 
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This seems pretty bizarre, and one has to ask, why does this work? It’s quite simple, if we recall the 
properties of determinants. Let’s solve for x1, noting that all other unknowns can be solved analogously. 
Start by simply multiplying x; by det|A|: 


x det|A]=det|xya, ia, i... ia, 
from linearity of det[ ] 
ot adding a multiple of any column to 
= det|xjaj + x8 1a, '.. 1a, ; . 
a another doesn't change the determinant 
= det]xjaj + X98) +... XAy 1A 1. Pa, ditto (n — 2) more times 
=det|Ax : a) :...:a,|/=det/b ia, i..ia, rewriting the first column 
det|b a, ifs a, 
=> eA = H ' ' 
det|A| 


Area and Volume as a Determinant 


(a,0) e 


Determining areas of regions defined by vectors is crucial to geometric physics in many areas. It is the 
essence of the Jacobian matrix used in variable transformations of multiple integrals. What is the area of the 
parallelogram defined by two vectors? This is the archetypal area for generalized (oblique, non-normal) 
coordinates. We will proceed in a series of steps, gradually becoming more general. 
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First, consider that the first vector is horizontal (above left). The area is simply base x height: A = ad. 
We can obviously write this as a determinant of the matrix of column vectors, though it is as-yet contrived: 


A= det =ad—(O)c=ad. 
Od 


eal 


For a general parallelogram (above right), we can take the big rectangle and subtract the smaller 
rectangles and triangles, by brute force: 


A=(a+c)(b+d)-—2be 23 Je 2( 5 Jab + a+ ob + 2 — Bho Di — a6 


a 
= ad — bc = det 
b 


c 

al’ 
This is simple enough in 2-D, but is incomprehensibly complicated in higher dimensions. We can 

achieve the same result more generally, in a way that allows for extension to higher dimensions by induction. 


Start again with the diagram above left, where the first vector is horizontal. We can rotate that to arrive at 
any arbitrary pair of vectors, thus removing the horizontal restriction: 


Let R = the rotation matrix. Then the rotated vectors are R|| and R/] 
a Cc ac ac ac 
det|R R = det] R =(d det = det 


The final equality is because rotation matrices are orthogonal, with det = 1. Thus the determinant of 
arbitrary vectors defining arbitrary parallelograms equals the determinant of the vectors spanning the 
parallelogram rotated to have one side horizontal, which equals the area of the parallelogram. 


What about the sign? If we reverse the two vectors, the area comes out negative! That’s ok, because in 
differential geometry, 2-D areas are signed: positive if we travel counter-clockwise from the first vector to 
the 2™! and negative if we travel clockwise. The above areas are positive. 


In 3-D, the signed volume of the parallelepiped defined by 3 vectors a, b, and c, is the determinant of 
the matrix formed by the vectors as columns (positive if abe form a right-handed set, negative if abe are a 
left-handed set). We show this with rotation matrices, similar to the 2-D case: First, assume that the 
parallelogram defined by be lies in the x-y plane (b; = c, = 0). Then the volume is simply (area of the base) 
x height: 


V= (area of base)( height) = C a Jee =detija, b, c, 
, 9 O 


where the last equality is from expansion by cofactors along the bottom row. But now, as before, we 
can rotate such a parallelepiped in 3 dimensions to get any arbitrary parallelepiped. As before, the rotation 
matrix is orthogonal (det = 1), and does not change the determinant of the matrix of column vectors. 


This procedure generalizes to arbitrary dimensions: the signed hyper-volume of a parallelepiped defined 
by 7 vectors in n-D space is the determinant of the matrix of column vectors. The sign is positive if the 3-D 
submanifold spanned by each contiguous subset of 3 vectors (V1V2V3, V2V3V4, V3V4Vs, ...) 1s right-handed, and 
negated for each subset of 3 vectors that is left-handed. 


The Jacobian Determinant and Change of Variables 


How do we change multiple variables in a multiple integral? Given 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 83 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


ill Ff (a,b,c) da db dc and the change of variables to u,v, w: 
a=a(u,v,w), b=b(u,v, w), c=C(Uu,Vv, Ww). The simplistic 
il f (a,b,c) da db dc > {flv [a(u,v, w), b(u, v, w), c(u,V, w)| du dv dw (wrong!) 


fails, because the “volume” du dv dw associated with each point of f(-) is different than the volume da 
db dc in the original integral. 


dw 


Example of new-coordinate volume element (du dv dw), and its corresponding old-coordinate 
volume element (da db dc). The new volume element is a rectangular parallelepiped. The old- 
coordinate parallelepiped has sides straight to first order in the original integration variables. 


In the diagram above, we see that the “volume” (du dv dw) is smaller than the old-coordinate “volume” 
(da db dc). Note that “volume” is a relative measure of volume in coordinate space; it has nothing to do 
with a “metric” on the space, and “distance” need not even be defined. 


The integrand is constant (to first order in the integration variables) over the whole volume element. 


Without some correction, the weighting of f{-) throughout the new-coordinate domain is different than 
the original integral, and so the integrated sum (i.e., the integral) is different. We correct this by putting in 
the original-coordinate differential volume (da db dc) as a function of the new differential coordinates, du, 
dv, dw. Of course, this function varies throughout the domain, so we can write 


[[[r@e0 da db dc > [I] [a(u,v, w), b(u, v, w), c(u, Vv, w)| V(u,v, w) du dv dw 
where V(u,v,w) takes (du dv dw) > (da db dc) 


To find V(-), consider how the a-b-c space vector daa is created from the new u-v-w space. It has 
contributions from displacements in all 3 new dimensions, u, v, and w: 


Oa 
Ov 


dbb -(2 He ee = aw 


daa = & du+—dv+ ae aw a. Similarly, 
Ou Ow 


dcé -(£ du+ a dv+ ce aw 
Ou Ov Ow 


The volume defined by the 3 vectors duit, dvv,and dww maps to the volume spanned by the 


corresponding 3 vectors in the original a-b-c space. The a-b-c space volume is given by the determinant of 
the components of the vectors da, db, and de (written as rows below, to match equations above): 
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a du a dv a dw a ae ad 
Ou ov Ow Ou Ov Ow 
volume = det op du a dv U2 dw) = det ae (du dv dw) . 
Ou ov Ow Ou Ov Ow 
as du a dv & dw ae a aa 
Ou ov Ow Ou Ov Ow 


where the last equality follows from linearity of the determinant. Note that all the partial derivatives are 
functions of u, v, and w. Hence, 


Ca Oa Oa 
ou @v éw 
V(u, Vv, w) = det ee iy ad = J(u,v,w) [ the Jacobian |, and 
ou Ov Ow 
Oc Oc Oc 
ou dv éw 


[[J fee dadbde > [If FLatanv,w),b(uv,w),cCu,v,w) JJ (uy, w) du dv dw 


QED. 


Expansion by Cofactors 


Let us construct the determinant operator from its two defining properties: linearity, and antisymmetry. 
First, we’ll define a linear operator, then we'll make it antisymmetric. [This section is optional, though 
instructive. | 


We first construct an operator which is linear in the first column. For the determinant to be linear in the 
first column, it must be a sum of terms each containing exactly one factor from the first column: 


Aj A\y ves Ain 
A A a (A 

Let A= 2 a £ an Fi then det A = Aj, (...)+ Ag (..-) +++ An (---)- 
An Ano oa Ann 


To be linear in the first column, the parentheses above must have no factors from the first column (else 
they would be quadratic in some components). Now to also be linear in the 2™ column, all of the parentheses 
above must be linear in all the remaining columns. Therefore, to fill in the parentheses we need a linear 
operator on columns 2...n. But that is the same kind of operator we set out to make: a linear operator on 
columns 1|..n. Recursion is clearly called for, therefore the parentheses should be filled in with more 
determinants: 


det A = A,,(det My )+ Ao; (detM, )+---+ A, (detM, ) (so far) . 


We now note that the determinant is linear both in the columns, and in the rows. This means that det Mi 
must not have any factors from the first row or the first column of A. Hence, M; must be the submatrix of 
A with the first row and first column stricken out. 
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1st column 1*t column 
1st row [ : . 
21 Agy Ay, 2nd row 
ij > M,, etc. 
L44nl Ana Ann 


Similarly, M> must be the submatrix of A with the 2™ row and first column stricken out. And so on, 
through M,,, which must be the submatrix of A with the n™ row and first column stricken out. We now have 
an operator that is linear in all the rows and columns of A. 


So far, this operator is not unique. We could multiply each term in the operator by a constant, and still 
preserve linearity in all rows and columns: 


det A = k,A;, (det M,)+k,Ap, (detM, )+---+k,A,; (detM,, ) . 


We choose these constants to provide the 2™ property of determinants: antisymmetry. The determinant 
is antisymmetric on interchange of any two rows. We start by considering swapping the first two rows: 
Define A’ = (A with Ay« © Ag»). 


| Ai, Aj : Ain | 
swap “ swapped 
ae a a «de pred 
A; 7 + >A’ ; ; A; - + |->M', ete. 
L An A, n _| nl A, wh Ann 


Recall that M; strikes out the first row, and Mb strikes out the 2" row, so swapping row | with row 2 
replaces the first two terms of the determinant: 


det A = k,A,, (det M,)+kA,,(detM,)+.... > det A' = kA, (det M',)+k,A,, (detM', )+... 
But M’; = Mb, and M’2 = Mi). So we have: 
> det A' = k,Ay, (det M, )+k,A,, (det M, )+.... 


This last form is the same as det A, but with k; and kz swapped. To make our determinant antisymmetric, we 
must choose constants k; and k2 such that terms | and 2 are antisymmetric on interchange of rows | and 2. 
This simply means that kj =—k2. So far, the determinant is unique only up to an arbitrary factor, so we choose 
the simplest such constants: kj = 1, kz =—1. 


For M3 through M,, swapping the first two rows of A swaps the first two rows of M’3 through M?’,,: 


swapped C 


Since M3 through M, appear inside determinant operators, and such operators are defined to be 
antisymmetric on interchange of rows, terms 3 through n also change sign on swapping the first two rows of 
A. Thus, all the terms | through n change sign on swapping rows | and 2, and det A =—det A’. 
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We are almost done. We have now a unique determinant operator, with kj = 1, kx =—1. We must 
determine k3 through k,. So consider swapping rows | and 3 of A, which must also negate our determinant: 


A 1 A 2 A n 

la Ag) . m8 Aon swapped 21 Agy Aon 
Ay,  .  . . Ay, | >A" 1 (Ay +e At, | > M"), etc. 
An Ann ni Ayo . Ann 


Again, M”, through M”, have rows 1 & 3 swapped, and thus terms 4 through n are negated by their 
determinant operators. Also, M’. (formed by striking out row 2 of A) has its rows | & 2 swapped, and is 
also thus negated. 


The terms remaining to be accounted for are A,, (det M,) and k3A3, (det M,) . The new M”, is the same 


as the old Ms, but with its first two rows swapped. Similarly, the new M”3 is the same as the old Mj, but 
with its first two rows swapped. Hence, both terms | and 3 are negated by their determinant operators, so 
we must choose k3 = | to preserve that negation. 


Finally, proceeding in this way, we can consider swapping rows | & 4, etc. We find that the odd 
numbered k’s are all 1, and the even numbered k’s are all —1. 


We could also have started from the beginning by linearizing with column 2, and then we find that the k 
are opposite to those for column 1: this time for odd numbered rows, koaa = —1, and for even numbered rows, 
Keven = +1. The k’s simply alternate sign. This leads to the final form of cofactor expansion about any column 
Cc: 


det A =(-1)'*° A, (det My )+ (-1)?** Ap, (detM, )+---+(-1)"** A, (det M,, ). 


Note that: 


We can perform a cofactor expansion down any column, 
or across any row, to compute the determinant of a matrix. 


We usually choose an expansion order which includes as many zeros as possible, to minimize the 
computations needed. 


Proof That the Determinant Is Unique 


If we compute the determinant of a matrix two ways, from two different cofactor expansions, do we get 
the same result? Yes. We here prove the determinant is unique by showing that in a cofactor expansion, 
every possible combination of elements from the rows and columns appears exactly once. This is true no 
matter what row or column we expand on. Thus all expansions include the same terms, but just written in a 
different order. 


Also, this complete expansion of all combinations of elements is a useful property of the cofactor 
expansion which has many applications beyond determinants. For example, by performing a cofactor 
expansion without the alternating signs (in other word, an expansion in minors), we can fully symmetrize a 
set of functions (such as boson wave functions). 


The proof: let’s count the number of terms in a cofactor expansion of a determinant for an nxn matrix. 
We do this by mathematical induction. For the first level of expansion, we choose a row or column, and 
construct n terms, where each term includes a cofactor (a sub-determinant of an n—1 x n—1 matrix). Thus, 
the number of terms in an nxn determinant is 1 times the number of terms in an n—1 x n—1 determinant. Or, 
turned around, 


#terms in (n+1xn+l) =(n+1)(#terms in nxn). 
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There is one term in a 1x1 determinant, 2 terms in a 2x2, 6 terms in a 3x3, and thus n! terms in an nxn 
determinant. Each term is unique within the expansion: by construction, no term appears twice as we work 
our way through the cofactor expansion. 


Let’s compare this to the number of terms possible which are linear in every row and column: we have 
n choices for the first factor, n—1 choices for the second factor, and so on down to | choice for the last factor. 
That is, there are n! ways to construct terms linear in all the rows and columns. That is exactly the number 
of terms in the cofactor expansion, which means every cofactor expansion is a sum of all possible terms 
which are linear in the rows and columns. This proves that the determinant is unique up to a sign. 


To prove the sign of the cofactor expansion is also unique, we can consider one specific term in the sum. 
Consider the term which is the product of the main diagonal elements. This term is always positive, since 
TBS ?? 


Getting Determined 


You may have noticed that computing a determinant by cofactor expansion is computationally infeasible 
forn > ~15. There are n! terms of n factors each, requiring O(n - n!) operations. For n = 15, this is ~10!% 
operations, which would take about a day on a few GHz computer. For n = 20, it would take years. 


Is there a better way? Fortunately, yes. It can be done in O(n?) operations, so one can easily compute 
the determinant for n = 1000 or more. We do this by using the fact that adding a multiple of any row to 
another row does not change the determinant (which follows from anti-symmetry and linearity). Performing 
such row operations, we can convert the matrix to upper-right-triangular form, i.e., all the elements of A’ 
below the main diagonal are zero. 


A A A [A‘y A A’in-l A'n 
‘4 a - 0 A'y9 A's nl A’ sn 
A= a op Aan A‘= : : : : 
A A i, 0 0 A\ tnt Al nin 
nl n2 mn 
0 oO . 0 AS 


By construction, det|A’| = det|A|. Using the method of cofactors on A’, we expand down the first column 
of A’ and first column of every submatrix in the expansion. E.g., 


A's 


Only the first term in each expansion survives, because all the others are zero. Hence, det|A’| is the 
product of its diagonal elements: 


n 
det A = det A'= Lh: where A’; are the diagonal elements of A’. 
i=l 


Let’s look at the row operations needed to achieve upper-right-triangular form. We multiply the first 
row by (Az / Ai1) and subtract it from the 24 row. This makes the first element of the 2" row zero (below 
left): 
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Ay Ain Ariz) Ain Ay Ai Ai3 Ain Ay Ain Ai3— Ain 
0 B B B 0 B B B 0 B B B 
an 22 923 Pq 2% 22 923 P24 4 22 623 P24 
A3, Aj, 433 Agy O By By; Bag 0 0 Gy Cra 
Ay Ay Aggy Aaa O By Byy Bay 0 0 Cy Cry 


Perform this operation for rows 3 through n, and we have made the first column below row | all zero (above 
middle). Similarly, we can zero the 2"! column below row 2 by multiplying the (new) 2™ row by (B32 / Bz) 
and subtracting it from the 3! row. Perform this again on the 4" row, and we have the first two columns of 
the upper-right-triangular form (above right). Iterating for the first (n — 1) columns, we complete the upper- 
right-triangular form. The determinant is now the product of the diagonal elements. 


About how many operations did that take? There are n(n — 1)/2 row-operations needed, or O(n). Each 
row-operation takes from | to m multiplies (average n/2), and 1 to n additions (average n/2), summing to O(n) 
operations. Total operations is then of order 


O(n)O(n’) ~ O(n’) 


TBS: Proof that det|AB] = det|A| det|B| 
Advanced Matrices 


Getting to Home Basis 


We often wish to change the basis in which we express vectors and matrix operators, e.g. in quantum 
mechanics. We use a transformation matrix to transform the components of the vectors from the old basis to 
the new basis. Note that: 


We are not transforming the vectors; we are transforming the components of the vector 
from one basis to another. The vector itself is unchanged. 


There are two ways to visualize the transformation. In the first method, we write the decomposition of 
a vector into components in matrix form. We use the visualization from above that a matrix times a vector 
is a weighted sum of the columns of the matrix: 


10 Ov ve 1 0 0 
v=|0 1 O}/wy=|v* where e,=|0], e,=|1), e,=|0}. 
0 0 I|iv v 0 0 1 


If we wish to convert to the e1, e2, e3 basis, we simply write e,, ey, e, in the 1-2-3 basis: 


ad glilv v a d g 
v=|b e hiv’ j=|v where (in the 1-2-3basis): e,=|/b|, e,=|e|, e,=|h]. 
c f iffv ve c ft i 


Thus: 
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The columns of the transformation matrix are the old basis vectors written in the new basis. 


This is true even for non-ortho-normal bases. 


Now let us look at the same transformation matrix, from the viewpoint of its rows. For this, we must 
restrict ourselves to ortho-normal bases. This is usually not much of a restriction. Recall that the component 
of a vector v in the direction of a basis vector e; is given by: 


v = v=(e,-v)e, +(e, ve, +(e, -v)e, . 
But this is a vector equation, valid in any basis. So i above could also be 1, 2, or 3 for the new basis: 
vi=e-Vv, v=e,-Vv, Vv =e,-V v=(e,-v)e, +(e,-v)e, +(e;-v)e,. 


Recall from the section above on matrix multiplication that multiplying a matrix by a vector is equivalent 
to making a set of dot products, one from each row, with the vector: 


e, eV yl (e:), (e), (e,). y* e,-V yl 
e, vi|=/e,-v|=|v° or (e,), (e2), (e,). || v? J=leo-v]= v 
e; e,-v) |v (es), (es), (es). ))¥] Les-v v3 


Thus: 


The rows of the transformation matrix are the new basis vectors written in the old basis. 
This is only true for ortho-normal bases. 


There is a beguiling symmetry, and nonsymmetry, in the above two boxed statements about the columns 
and rows of the transformation matrix. 


For complex vectors, we must use the dot product defined with the conjugate of the row basis vector, 
i.e. the rows of the transformation matrix are the hermitian adjoints of the new basis vectors written in the 
old basis: 


e! e,-v v! 
e,| Vig e, “Vi= v? 
e,| e3 ‘Vv y 


Diagonalizing a Self-Adjoint Matrix 


A special case of basis changing comes up often in quantum mechanics: we wish to change to the basis 
of eigenvectors of a given operator. In this basis, the basis vectors (which are also eigenvectors) always have 
the form of a single ‘1’ component, and the rest 0. E.g., 


1 0 0 
e, =| 0 e, =| 1 e,=|0 
0 0 1 


The matrix operator A, in this basis (its own eigenbasis), is diagonal, because: 
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Ae, = Ae, A, 
Ae, = A,e, > A= A, 
Ae, = Ae, A, 


Finding the unitary (i.e., unit magnitude) transformation from a given basis to the eigenbasis of an 
operator is called diagonalizing the matrix. We saw above that the transformation matrix from one basis to 
another is just the hermitian adjoint of the new basis vectors written in the old basis. We call this matrix U: 


e,| e,-v v! e,| 
e,! e, ‘Vv v e,! 


U transforms vectors, but how do we transform the operator matrix A itself? The simplest way to see 
this is to note that we can perform the operation A in any basis by transforming the vector back to the original 
basis, using A in the original basis, and then transforming the result to the new basis: 

View = UV 


new old 


A yeVnew =U(AaVou) =U (AgaU'Vyey) = (WAU) Vpey => Ae. =(UA,,U") 


> v 


new * new old new 


where we used the fact that matrix multiplication is associative. Thus: 


We can see this another way by starting with: 
AU'=Ale, se, 1e; |=] Ae, : Ae, : Ae; |=| Ae, } Ae. | Ae; 


e,; are the otho-normal eigenvectors 
where 
A, are the eigenvalues 


Recall the eigenvectors (of self-adjoint matrices) are orthogonal, so we can now pre-multiply by the 
hermitian conjugate of the eigenvector matrix: 


UAU! = a 


Ae, | Ay | A€3 


A, (e,-€1) Jy (0-5) em A 0 0 


= Ay (@€) Ay (en -€2) x |=|0 2, 0 
ee a A;(€3-€3) ow 


where the final equality is because each element of the result is the inner product of two eigenvectors, 
weighted by an eigenvalue. The only non-zero inner products are between the same eigenvectors 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 91 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


(orthogonality), so only diagonal elements are non-zero. Since the eigenvectors are normalized, their inner 
product is 1, leaving only the weight (1.e., the eigenvalue) as the result. 


Warning Some reference write the diagonalization as U-'AU, instead of the correct UAU"!. This is 
confusing, and inconsistent with vector transformation. Many of these very references then 
change their notation when they have to transform a vector, because nearly all references 
agree that vectors transform with U, and not U"!. 


Contraction of Matrices 


You don’t see a dot product of matrices defined very often, but the concept comes up in physics, even if 
they don’t call it a “dot product.” We see such products in QM density matrices, and in tensor operations on 
vectors. We use it below in the “Trace” section for traces of products. 


For two matrices of the same size, we define the contraction of two matrices as the sum of the products 
of the corresponding elements (much like the dot product of two vectors). The contraction is a scalar. Picture 
the contraction as overlaying one matrix on top of the other, multiplying the stacked numbers (elements), and 
adding all the products: 


sum = A:B 


We use a colon to convey that the summation is over 2 dimensions (rows and columns) of A and B 
(whereas the single-dot dot product of vectors sums over the 1 dimensional list of vector components): 


A:B= a,b, For example, for 3x3 matrices: 


aD,  +4,b, +4,,b3 
A?Ba tab) +g by +Ayyby3 = AyD, + Ayydyy + GygD,3 + Ayidy) + Agydy + Aysdy, + As)D5) + AyyDyy + dyad, 
+43,b,,  +dyb, +433, 
which is a single number. 


If the matrices are complex, we do not conjugate the left matrix (such conjugation is often done in 
defining the dot product of complex vectors). 


Trace of a Product of Matrices 


The trace of a matrix is defined as the sum of the diagonal elements: 


ay, 


Tr(A)= Da, Beg AS\ ay Wig Gz, Tr(A) = 4,, +4, +a,;. 
= 


a3, As, 


The trace of a product of matrices comes up often, e.g. in quantum field theory. We first show that Tr(AB) 
= Tr(BA): 
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Let C= AB. > Tr(AB) =, +Cy +...+6,, 
Ti th 
Define a,» as the r" row of A, and by. as the c” column of B 
rw 

a, ay, a3 b, Db, b,; a, Ay a); ( 11 12 13 
Cy = Ap Dy, Ay, Ay a, || b, b, db, or ay, Ay a 

d;, Gy a By. Dy Deg a,, dy a 

a, 4 a;,\(b, Bb, Db, Gi he thy : : : 

 s r r 

Cy Ay. Dy Ay, Ay, a3 b,, by, b, or ay, Ay, a3 (B a (B bo (B L, 

G3, Ay 433 )\b,, bb, a;, dy a 

and so on. 


The diagonal elements of the product C are the sums of the overlays of the rows of A on the columns of 
B. But this is the same as the overlays of the rows of A on the rows of B’. Then we sum the overlays, i.e., 
we overlay A onto B’, and sum ail the products of all the overlaid elements: 


Tr(AB) = A:B’. 


Now consider Tr(BA) = B: A’. But visually, B : A? overlays the same pairs of elements as A : B’, but 
in the transposed order. When we sum over all the products of the pairs, we get the same sum either way: 


Tr(AB) = Tr(BA) because A:B' =B:A’. 
This leads to the important cyclic property for the trace of the product of several matrices: 


Tr(AB...C) = Tr(CAB...) because Tr((AB...)C) = Tr(C(AB...)). 


and matrix multiplication is associative. By simple induction, any cyclic rotation of the matrices leaves 
the trace unchanged. 


Linear Algebra Briefs 


The determinant equals the product of the eigenvalues: 


n 
det A = LL where A; are the eigenvalues of A. 
i=l 


This is because the eigenvalues are unchanged through a similarity transformation. If we diagonalize 
the matrix, the main diagonal consists of the eigenvalues, and the determinant of a diagonal matrix is the 
product of the diagonal elements (by cofactor expansion). 
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7 Introduction to Probability, Statistics, and Data 
Analysis 


I think probability and statistics are among the most conceptually difficult topics in mathematical 
physics. We start with a brief overview of the basics, but overall, we assume you are familiar with simple 
probabilities, and gaussian distributions. 


Probability and Random Variables 


We assume you have a basic idea of probability, and since we seek here understanding over mathematical 
purity, we give here intuitive definitions. A random variable, say X, is a quantity that you can observe (or 
measure), multiple times (at least in principle), and is not completely predictable. Each observation 
(instance) of a random variable may give a different value. Random variables may be discrete (the roll of a 
die), or continuous (the angle of a game spinner after you spin it). A uniform random variable has all its 
values equally likely. Thus the roll of a (fair) die is a uniform discrete random variable. The angle of a game 
spinner is a uniform continuous random variable. But in general, the values of a random variable are not 
necessarily equally likely. For example, a gaussian (aka “normal”) random variable is more likely to be near 
the mean. 


Given a large sample of observations of any physical quantity X, there will be some structure to the 
values X assumes. For discrete random variables, each possible value will appear (close to) some fixed 
fraction of the time in any large sample. The fraction of a large sample that a given value appears is that 
value’s probability. For a 6-sided die, the probability of rolling 1 is 1/6, i.e. Pr(1) = 1/6. Because probability 
is a fraction of a total, it is always dimensionless, and between 0 and | inclusive: 


0 < Pr(anything) <1. 


[Note that one can imagine systems of chance specifically constructed to not provide consistency between 
samples, at least not on realistic time scales. By definition, then, observations of such a system do nof constitute a 
random variable in the sense of our definition. | 


Strictly speaking, a statistic is a number that summarizes in some way a set of random values. Many 
people use the word informally, though, to mean the raw data from which we compute true statistics. 


Conditional Probability 


Conditional probability specifically addresses probability when the state of the system is partly known. 
A priori probability generally implies less knowledge of state (“‘a priori” means “in the beginning” or 
“beforehand’”’). But there is no true, fundamental distinction, because all probabilities are in some way 
dependent on both physics and knowledge. 


Suppose you have one bag with 2 white and 2 black balls. You draw 2 balls without replacement. What 
is the chance the 2" ball will be white? A priori, it’s obviously 2. However, suppose the first ball is known 
white. Now Pr(2" ball is white) = 1/3. So we say the conditional probability that the 2"! ball will be white, 
given that the first ball is white, is 1/3. In symbols: 


Pr(2nd ball white | first ball white) =1/3. 


Another example of how conditional probability of an event can be different than the a priori probability 
of that event: I have a bag of white and a bag of black balls. I give you a bag at random. What is the chance 
the 2"! ball will be white? A priori, it’s % After seeing the 1‘ ball is white, now Pr(2™ ball is white) = 1. 
In this case, 


Pr(2nd ball white | first ball white) =1. 
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Precise Statement of the Question Is Critical 


Many arguments arise about probability because the questions are imprecise: each combatant has a 
different interpretation of the question, but neither realizes the other is arguing a different issue. Consider 
this: 


You deal 4 cards from a shuffled standard deck of 52 cards. I tell you 3 of them are aces. What is the 
probability that the 4" card is also an ace? 


The question is ambiguous, and could reasonably be interpreted two ways, but the two interpretations 
have quite different answers. It is very important to know exactly how I have discovered that 3 of them are 
aces. 


Case 1: I look at the 4 cards and say “At least 3 of these cards are aces.” There are 193 ways that 4 
cards can hold at least 3 aces (derived below), and only | of those ways has 4 aces. Therefore, the chance of 
the 4" card being an ace is 1/193. 


Case 2: I look at only 3 of the 4 cards and say, “These 3 cards are aces.” There are 49 unseen cards, all 
equally likely to be the 4" card. Only one of them is an ace. Therefore, the chance of the 4" card being an 
ace is 1/49. 


It may help to show that we can calculate the 1/49 chance from the 193 hands that have at least 3 aces: 
Of the 192 that have exactly 3 aces, we expect that 1/4 of them = 48 will show aces as their first 3 cards 
(because the non-ace has probability 1/4 of being last) . Additionally, the one hand of 4 aces will always 
show aces as its first 3 cards. Hence, of the 193 hands with at least 3 aces, 49 show aces as their first 3 cards, 
of which exactly | will be the 4-ace hand. Hence, its conditional probability, given that the first 3 cards are 
aces, is 1/49. 


Here is an ad-hoc derivation that there are 193 sets of 4 cards with at least 3 aces: First, how many ways 
(combinations) are there to have exactly 3 aces in 4 cards? The # ways to have a set of 3 aces = # ways to 
omit 1 ace = 4. Then there are 48 non-ace cards left for the 4" card. Thus there are 4*48 = 192 ways 
(combinations) to have exactly 3 aces in 4 cards. There is 1 combination of 4 aces in 4 cards. Thus there are 
192 + 1 = 193 ways (combinations) to have at least 3 aces in 4 cards. 


Let’s Make a Deal 


This is an example of a problem that confuses many people (including me), and how to properly analyze 
it. We hope this example illustrates some general methods of analysis that you can use to navigate more 
general confusing questions. In particular, the methods used here apply to renormalizing entangled quantum 
states when a measurement of one value is made. 


You are in the Big Deal on the game show Let’s Make a Deal. There are 3 doors. Hidden behind two 
of them are goats; behind the other is the Big Prize. You choose door #1. Monty Hall, the MC, knows what’s 
behind each door. He opens door #2, and shows you a goat. Now he asks, do you want to stick with your 
door choice, or switch to door #3? Should you switch? 


Answer: Without loss of generality (WLOG), we assume you choose door #1 (and of course, it doesn’t 
matter which door you choose). You know ahead of time that no matter what, Monty will show you a goat 
behind one of the doors you didn’t pick. For simple questions like this, you can make an exhaustive chart of 
all the mutually exclusive events, and their probabilities: 


Bgg shows door #2 1/6 

shows door #3 1/6 
eBg shows door #3 1/3 
geB shows door #2 1/3 


After you choose, Monty shows you that door #2 is a goat. So from the population of possibilities, we strike 
out those that are no longer possible (i.e., where he shows door #3, and those where the big prize is #2), and 
renormalize the remaining probabilities: 
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Bgg shows door #2 6 1/3 
shows-door #316 
shows-door #3 13 

eBg 

geB shows door #2 13 2/3 


Another way to think of this: Monty showing you door #2 is equivalent to saying, “The big prize is 
either the door you picked, or it’s door #3.” Since your chance of having picked right (1/3) is unaffected by 
Monty telling you this, Pr(big prize is #3) = 2/3. Monty uses his knowledge to always pick a door with a 
goat. That gives you information, which improves your ability to guess right on your second guess. 


You can also see it this way: There’s a 1/3 chance you picked right the first time. If you switch, you’ll 
lose. But there’s a 2/3 chance you picked wrong the first time. If you switch, you’ll win. So by switching, 
you win twice as often as you lose, much better odds than 1/3 of winning. 


Let’s take a more extreme example: suppose there are 100 doors, and you pick #1. Now Monty tells 
you, “The big prize is either the door you picked, or it’s door #57.” Should you switch? Of course. The 
chance you guessed right is tiny, but Monty knows for sure. 


How to Lie With Statistics 


In 2007, on the front page of newspapers, was a story about a big study of sexual behavior in America. 
The headline point was that on average, heterosexual men have 7 partners in their lives, and women have 
only 4. 


Innumeracy, a book about math and statistics, uses this exact same claim from a previous study of sexual 
behavior, and noted that one can easily prove that the average number of heterosexual partners of men and 
women must be exactly the same (if there are equal numbers of men and women in the population. The US 
has equal numbers of men and women to better than 1%). 


The only explanation for the survey results is that many people are lying. Typically, men lie on the high 
side, women lie on the low side. The article goes on to quote all kinds of statistics and “facts,” oblivious to 
the fact that these claims are based on lies. So how much can you believe anything the study subjects said? 


Even more amazing to me is that the “scientists” doing the study seem equally oblivious to the 
mathematical impossibility of their results. (Perhaps some graduate student got a PhD out of this study, too.) 


The proof: every heterosexual encounter involves a man and a woman. If the partners are new to each 
other, then it counts as a new partner for both the man and the woman. The average number of partners for 
men is the total number of new partners for all men divided by the number of men in the US. But this is 
equal to the total number of new partners for all women divided by the number of women in the US. QED. 


[An insightful friend noted, “Maybe to some women, some guys aren’t worth counting.” ] 
Choosing Wisely: An Informative Puzzle 


n 
Here’s a puzzle which illuminates the physical meaning of the q binomial forms. Try it yourself 


before reading the answer. Really. First, recall that: 


n! 


” =n choose k = ————_.. 
() k\(n—-k)! 


n 
is the number of combinations of k items taken from n distinct items; more precisely, q is the number of 


ways of choosing k items from n distinct items, without replacement, where the order of choosing doesn’t 
matter. 
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n+l n n 
The puzzle: Show in words, without algebra, that k = ee] + ab 


Some purists may complain that the demonstration below lacks rigor (not true), or that the algebraic 
demonstration is “shorter.” However, though the algebraic proof is straightforward, it is dull and 
uninformative. Some may like the demonstration here because it uses the physical meaning of the 
mathematics to reach an iron-clad conclusion. 


The solution: The LHS is the number of ways of choosing k items from n + 1 distinct items. Now there 
are two distinct subsets of those ways: those ways that include the (n + 1)” item, and those that don’t. In the 
first subset, given the (1 + 1)” item, we must choose k — 1 more items from the remaining n, and there are 


n n 
[, ; ways to do this. In the second subset, we must choose all k items from the first n, and there are 4 


ways to do this. Since this covers all the possible ways to choose k items from n + | items, it must be that 
n+l) [{ n : n QED 
k ) \k-1) (ky 


Multiple Events 


First we summarize the rules for computing the probability of combinations of independent events from 
their individual probabilities, then we justify them: 


Pr(A and B) = Pr(A)- Pr(B), A and B independent 

Pr(A or B) = Pr(A) + Pr(B), A and B mutually exclusive 
Pr(not A) = 1 — Pr(A) 

Pr(A or B) = Pr(A) + Pr(B) — Pr(A)Pr(B), = always. 


For independent events A and B, Pr(A and B) = Pr(A)-Pr(B). This follows from the definition of 
probability as a fraction. If A and B are independent (have nothing to do with each other), then Pr(A) is the 
fraction of trials with event A. Then of the fraction of those with event A, the fraction that also has B is 
Pr(B). Therefore, the fraction of the total trials with both A and B is: 


Pr(A and B) = Pr(A)-Pr(B). 


For mutually exclusive events, Pr(A or B) = Pr(A) + Pr(B). This also follows from the definition of 
probability as a fraction. The fraction of trials with event A = Pr(A); fraction with event B = Pr(B). If no 
trial can contain both A and B, then the fraction with either is simply the sum (figure below). 


fraction with A | fraction with B 


<— - -- -  fractionwithAorB - - - > 


Total trials 


Pr(not A) = 1 — Pr(A). Since Pr(A) is the fraction of trials with event A, and all trials must either have 
event A or not: 


Pr(A) + Pr(not A) = 1. 


Notice that A and (not A) are mutually exclusive events (a trial can’t both have A and not have A), so their 
probabilities add. 


By Pr(A or B) we mean Pr(A or B or both). For independent events, you might think that Pr(A or B) = 
Pr(A) + Pr(B), but this is not so. A simple example shows that it can’t be: suppose Pr(A) = Pr(B) =0.7. Then 
Pr(A) + Pr(B) = 1.4, which can’t be the probability of anything. The reason for the failure of simple addition 
of probabilities is that doing so counts the probability of (A and B) twice (figure below): 
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fraction with A only | fraction with A and B fraction with B only 


ee fraction wih AorB - - - -- + 


Total trials 


Note that Pr(A or B) is equivalent to Pr(A and maybe B) or Pr(B and maybe A). But Pr(A and maybe B) 
includes the probability of both A and B, as does Pr(B and maybe A), hence it is counted twice. So subtracting 
the probability of (A and B) makes it counted only once: 


Pr(A or B) = Pr(A) + Pr(B) — Pr(A)Pr(B), A and B independent. 
A more complete statement, which breaks down (A or B) into mutually exclusive events is: 
Pr(A or B) = Pr(A and not B) + Pr(not A and B) + Pr(A and B) 
Since the right hand side is now mutually exclusive events, their probabilities add: 
Pr(A or B) = Pr(A)[1 — Pr(B)] + Pr(B)[1 — Pr(A)] + Pr(A)Pr(B) 
= Pr(A) + Pr(B) — 2Pr(A)Pr(B) + Pr(A)Pr(B) 
= Pr(A) + Pr(B) — Pr(A)Pr(B) . 
TBS: Example of rolling 2 dice. 


Combining Probabilities 


Here is a more in-depth view of multiple events, with several examples. This section should be called 
“Probability Calculus,” but most people associate “calculus” with something hard, and I didn’t want to scare 
them off. In fact, calculus simply means “‘a method of calculation.” 


Boolean algebra is the mathematics of expressions and variables that can have one of only two values: 
usually taken to be “true” and “false.” We will use only a few simple, intuitive aspects of Boolean algebra 
here. 


An event is something that can either happen, or not (it’s binary!). We define the probability of an 
event as the fraction of time, out of many (possibly hypothetical) trials, that the given event happens. For 
example, the probability of getting a “heads” from a toss of a fair coin is 0.5, which we might write as 
Pr(heads) = 0.5 = 1/2. Probability is a fraction of a whole, and so lies in [0, 1]. 


We now consider two random events. Two events have one of 3 relationships: independent, mutually 
exclusive, or conditional (aka conditionally dependent). We will soon see that the first two are special cases 
of the “conditional” relationship. We now consider each relationship, in turn. 


Independent: For now, we define independent events as events that have nothing to do with each other, 
and no effect on each other. For example, consider two events: tossing a heads, and rolling a 1 on a 6-sided 
die. Then Pr(heads) = 1/2, and Pr(rolling 1) = 1/6. The events are independent, since the coin cannot 
influence the die, and the die cannot influence the coin. We define one “trial” as two actions: a toss and a 
roll. Since probabilities are fractions, of all trials, % will have “heads”, and 1/6 of those will roll a 1. 
Therefore, 1/12 of all trials will contain both a “heads” and a 1. We see that probabilities of independent 
events multiply. We write: 


Pr(A and B) = Pr(A)Pr(B) . (independent events). 


In fact, this is the precise definition of independence: if the probability of two events both occurring is the 
product of the individual probabilities, then the events are independent. 


[Aside: This definition extends to PDFs: if the joint PDF of two random variables is the product of their 
individual PDFs, then the random variables are independent.] 
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Geometric diagrams are very helpful in understanding the probability calculus. We can picture the 
probabilities of A, B, and (A and B) as areas. The sample space or population is the set of all possible 
outcomes of trials. We draw that as a rectangle. Each point in the rectangle represents one possible outcome. 
Therefore, the probability of an outcome being within a region of the population is proportional to the area 
of the region. 


Figure 7.1 (a): An event A either happens, or it doesn’t. Therefore: 


Pr(A) + Pr(~A) = 1 (always) . 

1 

sample space, 

aka population 

not A 
™ Ea 
0 Pr(A) 1 0 Pr(A) 1 
(a) (b) independent (c) conditional (d) mutually exclusive 


Figure 7.1 The (continuous) sample space is the square. Areas are proportional to probabilities. (a) 
An event either happens, or it doesn’t. (b) Events A and B are independent. (c) A and B are 
dependent. (d) A and B are mutually exclusive. 


Figure 7.1 (b): Pr(A) is the same whether B occurs or not, shown by the fraction of B covered by A is 
the same as the fraction of the sample space covered by A. Therefore, by definition, A and B are independent. 


Figure 7.1 (c): The probability of (A or B (or both)) is the red, blue, and magenta areas. Geometrically 
then, we see: 


Pr(A or B) = Pr(A) + Pr(B) — Pr(A and B) (always). 
This is always true, regardless of any dependence between A and B. 


Conditionally dependent: From the diagram, when A and B are conditionally dependent, we see that 
the Pr(B) depends on whether A happens or not. Pr(B given that A occurred) is written as Pr(B | A), and read 
as “probability of B given A.” From the ratio of the magenta area to the red, we see 


Pr(B | A) = Pr(B and A)/Pr(A) . (always). 


Mutually exclusive: Two events are mutually exclusive when they cannot both happen (Figure 7.1d). 
Thus, 


Pr(A and B) = 0, and Pr(A or B) = Pr(A) + Pr(B) (mutually exclusive) . 
Note that Pr(A or B) follows the rule from above, which always applies. 
We see that independent events are an extreme case of conditional events: independent events satisfy: 
Pr(B | A) = Pr(B) (independent). 
since the occurrence of A has no effect on B. Also, mutually exclusive events satisfy: 


Pr(B | A) =0 (mutually exclusive). 
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Summary of Probability Calculus 


Always true: 


Pr(~A) = 1 — Pr(A) Pr(entire sample space) = 1, Figure 7.1a. 


Pr(A or B) = Pr(A) + Pr(B) — Pr(A and B) Subtract off any double-count of “A and 
B,” Figure 7.1c. 


Precise def’n of “independent.” 
Using the “and” and “or” rules above. 
Def’n of “mutually exclusive.” 


Pr(A or B) from above. 


Pr(B and A) = Pr(B | A)Pr(A) = Pr(A | B)Pr(B) Bayes’ Rule: Shows relationship between 
Pr(B | A) and Pr(A | B). 
Pr(A or B) = Pr(A) + Pr(B) — Pr(A and B) Same as “Always true” rule above. 


Note that the “and” rules are often simpler than the “or” rules. 


To B, or To Not B? 
Sometimes its easier to compute Pr(~A) than Pr(A). Then we can find Pr(A) from Pr(A) = 1 — Pr(~A). 


Example: What is the probability of rolling 4 or more with two dice? 


The population has 36 possibilities. To compute the probability directly, we use: 


3 + 4 +5 464+54+44+3+ 2 4+ | =33 > Pre 4) = 22, 
ways to waysto -*: ‘+ ways to ways to 36 
roll 4 roll5 roll 11 roll 12 
That’s a lot of addition. It’s much easier to note that: 
Pr(<4)= 1 + 2 =3 = Pr(< 4) = = and Pr(= 4)=1-—Pr(<4)= = : 


ways to ways to 
roll 2 roll 3 


In particular, the “and” rules are often simpler than the “or” rule. Therefore, when asked for the 
probability of “this or that”, it is sometimes simpler to convert to its complementary “and” statement, 
compute the “and” probability, and subtract it from | to find the “or” probability. 


Example: From a standard 52-card deck, draw a single card. What is the chance it is a spade or a face- 
card (or both)? Note that these events are independent. 
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To compute directly, we use the “or” rule: 


Pr(spade) =1/ 4, Pr( facecard ) = 3/13, 


Pr(spade or facecard) =—+ -_ Ly SO le Mee 
4 13 4 13 52 52 


Because the two events are independent, it may be simpler to compute the probability of drawing neither a 
spade nor a face-card, and subtracting from 1: 


Pr(~ spade) = 3/4, Pr(~ facecard) = 10/13, 
3 10 I 30 22 
4 13 52 52 


Pr(spade or facecard) =1—Pr(~ spade and ~ facecard) =1 


The benefit of converting to the simpler “and” rule increases with more “or” terms, as shown in the next 
example. 


Example: Remove the 12 face cards from a standard 52-card deck, leaving 40 number cards (aces are 
1). Draw a single card. What is the chance it is a spade (S), low (L) (4 or less), or odd (O)? Note that these 
3 events are independent. 


To compute directly, we can count up the number of ways the conditions can be met, and divide by the 
population of 40 cards. There are 10 spades, 16 low cards, and 20 odd numbers. But we can’t just sum those 
numbers, because we would double (and triple) count many of the cards. 


To compute directly, we must extend the “or” rule to 3 conditions, shown below. 


bs 
ey, 


Figure 7.2 Venn diagram for Spade, Low, and Odd. 


Without proof, we state that the direct computation from a 3-term “or” rule is this: 
Pr(S) =1/4, Pr(L) =4/10, =Pr(O) =1/2 
Pr(S or Lor O) = Pr(S) + Pr(L) + Pr(O) 
—Pr(S) Pr(L) — Pr(S) Pr(O) — Pr(L) Pr(O) + Pr(S) Pr(L) Pr(O) 


i aioe CeCe ee ee wel 
=—+—+ . . -— |+] —-—- 
4°10 2 (J *) (J ;) (4 ;) (J 10 ;) 


—10+16+20-4-5-8+2 31 
40 40 


It is far easier to compute the chance that it is none of these (neither spade, nor low, nor odd): 
Pr(~ S$) =3/4, =Pr(~ L)=6/10, = Pr(~ O) = 1/2 
Pr(S or Lor O) =1—Pr(~ S and ~ Land ~ O) =1—Pr(~ S$) Pr(~ L)Pr(~ O) 
ee ae 31 
4 10 2 40 40 


You may have noticed that converting “S or L or O” into “~(~S and ~L and ~O)” is an example of De 
Morgan’s theorem from Boolean algebra. 
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Continuous Random Variables and Distributions 


Probability is a little more complicated for continuous random variables. A continuous population is a 
set of random values than can take on values in a continuous interval of real numbers; for example, if I spin 
a board-game spinner, the little arrow can point in any direction: 0 < 0 < 2z. 


d=0 


d 
Il 
a 


Figure 7.3 Board game spinner 


Furthermore, all angles are equally likely. By inspection, we see that the probability of being in the first 
quadrant is 4, i.e. Pr(0 < 6 < 2/2) =%4. Similarly, the probability of being in any interval d@ is: 


Pr(@ in any interval d@) = 5-40 ; 
1 


If I ask, “what is the chance that it will land at exactly 6 = 2?” the probability goes to zero, because the 
interval d@ goes to zero. In this simple example, the probability of being in any interval d@ is the same as 
being in any other interval of the same size. In general, however, some systems have a probability per unit 
interval that varies with the value of the random variable (call it X) (I wish I had a simple, everyday example 
of this??). So: 


Pr (Xx in an infinitesimal interval dx around x) = pdf, (x) dx, where 


pdfx(x) = the probability distribution function of X. 


pdfx(x) has units of [X]/x. 


E.g., if X has units of meters ([X] = m), and x also has units of meters ([x] = m), then pdfx(x) is dimensionless. 
If [X] = dimensionless and [f] = s, then pdfx(t) has units of I/s. 


By summing mutually exclusive probabilities, the probability of X in any finite interval [a, b] is: 
b 
Prax X <b) = } dx pdf(x) . 


Since any random variable X must have some real value, the total probability of X being between —co and 
+oo must be 1: 


Pr(-0<X <o)=[ dx pdf(x)=1. 


The probability distribution function of a random variable tells you 
everything there is to know about that random variable. 
Populations 


A population is a (often infinite) set of all possible values that a random variable may take on, along 
with their probabilities. A sample is a finite set of values of a random variable, where those values come 
from the population of all possible values. The same value may be repeated in a sample. We often use 
samples to estimate the characteristics of a much larger population. 


A trial or instance is one value of a random variable. 
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There is enormous confusion over the binomial (and similar) distributions, because each instance of a 
binomial random variable comes from many attempts at an event, where each attempt is labeled either 
“success” or “failure.” Superficially, an “attempt” looks like a “trial,” and many sources confuse the terms. 
In the binomial distribution, n attempts go into making a single trial (or instance) of a binomial random 
variable. 


Population Variance 


The variance of a population is a measure of the “spread” of any distribution, i.e. it is some measure of 
how widely spread out values of a random variable are likely to be [there are other measures of spread, too]. 
The variance of a population or sample is among the most important parameters in statistics. Variance is 
always = 0, and is defined as the average squared-difference between the random values and their average 
value: 


var(X)= (x - xy’) where ( )is an operator which takes the average X =(X). 


Note that: 
Whenever we write an operator such as var(X), we can think of it as a functional of the PDF of X 
(recall that a functional acts on a function to produce a number). 
var( X ) = var[pdfy (x)] =| (x- x) pdf y (x) dx = (x - x)’ . 


The units of variance are the square of the units of X. From the definition, we see that if I multiply a set of 
random numbers by a constant k, then I multiply their variance by k: 


var (kX ) = k* var(X) where X is any set of random numbers . 


Any function, including variance, with the above property is homogeneous-of-order-2 (2™ order 
homogeneous??). We will return later to methods of estimating the variance of a population. 


Population Standard Deviation 


The standard deviation of a population is another measure of the “spread” of a distribution, defined 
simply as the square root of the variance. Standard deviation is always = 0, and equals the root-mean-square 
(RMS) of the deviations from the average: 


dev( X )=.,/var(X) = K(x - xy’) where (_)is an operator which takes the average . 


As with variance, we can think of dev(X) as a functional acting on pdfx(x): dev[pdfx(x)]. The units of standard 
deviation are the units of X. From the definition, we see that if I multiply a set of random numbers by a 
constant k, then I multiply their standard deviation by k: 


dev(kX )=kdev(X) where  X is any set of random numbers . 


Standard deviation and variance are useful measures, even for non-normal populations. 


They have many universal properties, some of which we discuss as we go. There exist bounds on the 
percentage of any population contained with + co, for any number c. Even stronger bounds apply for all 


unimodal populations. 


Normal (aka Gaussian) Distribution 


From mathworld.wolfram.com/NormalDistribution.html : “While statisticians and mathematicians 
uniformly use the term ‘normal distribution’ for this distribution, physicists sometimes call it a gaussian 
distribution and, because of its curved flaring shape, social scientists refer to it as the ‘bell curve.’ ” 
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A gaussian distribution is one of a 2-parameter family of distributions defined as a population with: 


1 ay 
pec cat = population average 
2 AA = deere H= pop 8 [picture??]. 


pdf(x) = : ous 
To o = population standard deviation 


“and o are parameters: can be any real value, and o > 0 and real. This illustrates a common feature of 
named distributions: they are usually a family of distributions, parameterized by one or more parameters. A 
gaussian distribution is a 2-parameter distribution: 4 and o. As noted below: 


Any linear combination of gaussian random variables is another gaussian random variable. 


Gaussian distributions are the only such distributions [ref??]. 


New Random Variables From Old Ones 


Given two random variables X and Y, we can construct new random variables as functions of x and y 
(trial values of X and Y). One common such new random variable is simply the sum: 


Define Z=X+Y which means VY trials i, Zj =X, +);- 


We then ask, given pdfx(x) and pdfy(y) (which is all we can know about X and Y), what is pdfz(z)? To answer 
this, consider a particular value x of X; we see that: 


Given x: Pr(Z within dz of z) = Pr(Y within dz of (z -x)) d 
But x is a value of a random variable, so the total Pr(Z within dz of z) is the sum (integral) over all x: 
Pr(Z within dz of z) = } dx pdf y (x) Pr(Y within dz of (z—x)), but 


Pr(Y within dz of (z -x)) = pdfy (z-x)dz, so 


Pr(Z within dz of z) =dz } dx pdf y (x)pdty (z-x) 


= pdf, (z) = [ dx pdf y (x) pdfy (z—x). 


This integral way of combining two functions, pdfx(x) and pdfy(y) with a parameter z is called the 
convolution of pdfx and pdfy, which is a function of a number, z. 


Convolution of pdf, 


+ pdf (x) + pdfy(y) POP re et a 


rx >y x 


+» 
z=3 


The convolution evaluated at z is the area under the product pdfx(x)pdfy(z — x). 


From the above, we can easily deduce the pdfz(z) if Z=X-—-Y=X+(-Y). First, we find pdfy(y), and 
then use the convolution rule. Note that: 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 104 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


pdf(_y)(y) = pdfy (-y) 
> pdf, (z)= } ¥ dx pdf y (x) pdf,_y)(z- x) = } . dx pdf y (x) pdf (x—z) 
Since we are integrating from —0o to +00, we can shift x with no effect: 
sees > pdf, (z)= } “a pdf, (x +z) pdfy (x), 
which is the standard form for the correlation function of two functions, pdfx(x) and pdfyy). 


Correlation of pdf, 
s pdfy(x) 4 pdf(y) , with pdfy at z=2 


g=2 


The correlation function evaluated at z is the area under the product pdfx(x + z)pdfy(x). 


Note that the convolution of a gaussian distribution with a different gaussian is another gaussian. Therefore, 
the sum of a gaussian random variable with any other gaussian random variable is gaussian. 
Some Distributions Have Infinite Variance, or Infinite Average 


In principle, the only requirement on a PDF is that it be normalized: 
[E pattx) de =1. 


Such a distribution has well-defined probabilities for all x. However, even given that, it is possible that the 
variance is infinite (or properly, undefined). For example, consider: 


pdf(x) =2x3 a 


¥=( xpdf(x)dx=2, but oo? =[ x’ pdf(x)dro~. 
=0 x<l J fi 


The above distribution is normalized, and has finite average, but infinite deviation. The following example 
is even worse: 


=| x pdf(x) dx >», and a7 = |x" paf(x) dx > 0. 


pdf(x) =x? x21 
=0 x<l 


This distribution is normalized, but has both infinite average and infinite deviation. 


Are such distributions physically meaningful? Sometimes. The Lorentzian (aka Breit-Wigner) 
distribution is common in physics, or at least, a good approximation to physical phenomena. It has infinite 
average and deviation. It’s standard and parameterized forms are: 
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1 1 1 
L(x) = 3) LO; Xo.7) = : 5) 
n(1+x ) ay 1+((x-29)/7) 
where xq = location of peak, y = half-width at half-maximum 


This is approximately the energy distribution of particles created in high-energy collisions. It’s CDF is: 


1 
> 


1 X—X 
Cl srénizian (x) = arctan 0 ) + 
Ls Y 


Samples and Parameter Estimation 


Why Do We Use Least Squares, and Least Chi-Squared (7)? 


We frequently use “least sum-squared-residuals” (aka least squares) as our definition of “best.” Why 
sum-squared-residuals? Certainly, one could use other definitions (see least-sum-magnitudes below). 
However, least squares residuals are most common because they have many useful properties: 


e Squared residuals are reasonable: they’re always positive. 


e Squared residuals are continuous and differentiable functions of things like fit parameters 
(magnitude residual is not differentiable). Differentiable means we can analytically minimize 
it, and for linear fits, the resulting equations are linear. 


e The sum-of-squares identity is only valid for least-squares fits; this identity allows us to cleanly 
separate our data variation into a “model” and “residuals.” 


e Wecancompute many other analytic results from least squares, which is not generally true with 
other residual measures. 


e Variance is defined as average of squared deviation (aka “residual”), and variances of 
uncorrelated random values simply add. 


e The central limit theorem causes gaussian distributions to appear frequently in the natural 
world, and one of its two natural parameters is variance: an average squared-residual. 


e For gaussian residuals, least squares parameter estimates are also maximum likelihood (in the 
misleading statistical sense??), and minimum variance [Meyers 1996 p??]. 


Most other measures of residuals have fewer nice properties. Note that the residuals include both unmodeled 
behavior and (unmodelable) noise, so even if your noise is gaussian, your residuals are not. 


Why Not Least-Sum-Magnitudes? 


A common question is “Why not magnitude of residuals, instead of squared residuals?” Least-sum- 
magnitude residuals have at least two serious problems. First, they often yield clearly bad results; and second, 
least-sum-magnitude-residuals can be highly degenerate: there are often an infinite number of solutions that 
are “equally” good, and that’s bad. 


To illustrate, Figure 7.4a shows the least sum magnitude “average” for 3 points. Sliding the average line 
up or down increases the magnitude difference for points | and 2, and decreases the magnitude difference by 
the same amount for point 2. Points 1 and 2 totally dominate the result, regardless of how large point 2 is. 
This is intuitively undesirable for most purposes. 


Figure 7.4b and (c) show the degeneracy: both lines have equal sum magnitudes, but intuitively, fit (b) 
is vastly better for most purposes. 
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Figure 7.4 (a) least-sum-magnitude “average”. (b) Example fit to least-sum-magnitude-residuals. 
The sum-magnitude is unchanged by moving the “fit line” straight up or down. (c) Alternative “fit” 
has same sum-magnitude-residuals, but is a much less-likely fit for realistic residual distributions. 


(c) 


v 


Other Misfit Measures 


There are some cases where least squares residuals does not work well, in particular, if you have outliers 
in your data. When you square the residual to an outlier, you get a really big number. This squared-residual 
swamps out all your (real) residuals, thus wreaking havoc with your results. The usual practice is to identify 
the outliers, remove them, and analyze the remaining data with least-squares. However, on rare occasions, 
one might work with a residual measure other than least squared residuals [Myers ??]. 


When working with data where each measurement has its own uncertainty, we usually replace the least 
squared residuals criterion with least-chi-squared. We discuss this later when considering data with 
individual uncertainties. 


Average, Variance, and Standard Deviation 


In statistics, an efficient estimator = the most efficient estimator [ref??]. There is none better (i.e., none 
with smaller variance). You can prove mathematically that for gaussian populations, the average and 
variance of a sample are the most efficient estimators (least variance) of the population average and variance. 
It is impossible to do any better, so it’s not worth looking for better ways. For gaussian populations, the most 
efficient estimators are least squares estimators, which means that over many samples, they minimize the 
sum-squared error from the true value. We discuss least-squares vs. maximum-likelihood estimators later. 


Note, however, that given a set of measurements, some of them may not actually measure the population 
of interest (i.e., they may be noise). If you can identify those bad measurements from a sample, you should 
remove them before estimating any parameter. Usually, in real experiments, there is always some 
unremovable corruption of the desired signal, and this contributes to the uncertainty in the measurement. 


Recall that a sample is a set of n random values taken from a population X. The sample average is 
defined as: 


n 
__l 
=-) %, 
n 


i=l 


and is the least variance estimate of the average <X> of any population [ref??]. It is unbiased [ref??], which 
means the average of many sample estimates approaches the true population average: 


7 (x) where = average, over the given parameter if not obvious . 


() any samples vee what 


Note that the definition of unbiased is not that the estimator approaches the true value for large samples; it is 
that the average of the estimator approaches the true value over many samples, even small samples. See 
sample standard deviation, just below. 


The sample variance and standard deviation are defined as: 
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ae aie an . 
= di -x) where  X is the sample average, as above : X = (x; ) 


The sample variance is an efficient and unbiased estimate of var(X), which means no other estimate of var(X) 
is better. Note that s? is unbiased, but s is biased, because the square root of the average is not equal to the 
average of the square root: 


#dev(X) because (Vs? \ + (s7). 


This exemplifies the importance of properly defining “bias”: 


(s) 
many samples 


#dev(X) even though lim s =dev(X). 


s) 
( many samples nao 


Sometimes you see variance defined with 1/n, and sometimes with 1/(n — 1). Why? The population 
variance is defined as the mean-squared deviation from the population average. For a finite population (such 
as test scores in a given class), we find the population variance using 1/N, where N is the number of values 
in the whole population: 


N is the # of values in the entire population 
N 
var(X) = =>, - uy where  X, is the i” value of the population 


i=l 
jt = exact population average . 


In contrast, the sample variance is the variance of a sample taken from a population. The population 
average ji is usually unknown. We can only estimate yw ~ <x>. Then to make s? an unbiased estimate of 
var(X), we must use I/(n — 1 (as we show later)), where n is the sample size (not population size). 


The sample variance is actually a special case of curve fitting, where we fit a constant, <x>, to the 
population. This is a single parameter, and so removes | degree of freedom from our fit errors. Hence, the 
mean-squared fit error (i.e., 5”) has 1 degree of freedom less than the sample size. (Much more on curve 
fitting later). 


For a sample from a population when the average y is exactly known, we use n as the weighting for s” 
to be an unbiased estimator of var(X): 


1 n 
va = Die - uy ; which is just the above equation with X; > x;, Nn. 
i=l 


In contrast, infinite populations with unknown yu can only have samples, and thus always use n—1. But 
as n — 00, it doesn’t matter, so we can compute the population variance either way: 


n 


var X) = fim Ys x) lim = x)”, because n-1 > nwhenn—>~o. 


noon” i=1 
i= 


Central Limit Theorem For Continuous And Discrete Populations 


The central limit theorem is important because it allows us to estimate some properties of a population 
given only sample of the population, with no a priori information. Given a population, we can take a sample 
of it, and compute its average. If we take many samples, each will (likely) produce a different average. 
Hence, the average of a sample is a new random variable, created from the original. 
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The central limit theorem says that for any population, as the sample size grows, 


the sample average approaches a gaussian random variable, with average equal to the population 
average, and variance equal to the population variance divided by n. 


Mathematically, given a random variable X, with mean yu and variance oy” = var(X): 


2 
lim (x;) € gaussian HM, 2 where (x;) = sample average . 
no n 

Note that the central limit theorem applies only to multiple samples from a single population (though 
there are some variations of the theorem that can be applied to multiple populations). [It is possible to 
construct large sums of multiple populations whose averages do not approach gaussian, e.g. in 
communication theory, inter-symbol interference (IST). But we will not go further into that. ] 


How does the Central Limit Theorem apply to a discrete population? If a population is discrete, 
then any sample average is also discrete. But the gaussian distribution is continuous. So how can the sample 
average approach a gaussian for large sample size N? Though the sample average is discrete, the density of 
allowed values increases with N. If you simply plot the discrete values as points, those points approach the 
gaussian curve. For very large N, the points are so close, they “look” continuous. 


TBS: Why binomial (discreet), Poisson (discreet), and chi-squared (continuous) distributions approach 
gaussian for large n (or v). 


Uncertainty of Average 


A sample average x gives us an estimate of the population average . The sample average, when taken 
as a set of values of many samples, is itself a random variable. The Central Limit Theorem (CLT) says that 
if we know the population standard deviation o, the sample average will have standard deviation: 


dev(x) = a (proof below). 


Vn 
In statistics, dev(x) is called the standard error of the mean. In experiments, dev(x) is the 1-sigma 
uncertainty in our estimate of the population average 1. However, most often, we know neither yw nor o, and 
must estimate both from our sample, using x ands. For “large” samples, we use simply o ~ s, and then: 


= s : : 
x) — wie. n : 
dev for "large" samples, i.e. nis "large" 


Jn 


For small samples, we must still use s as our estimate of the population deviation, since we have nothing 
else. But instead of assuming that dev(x) is gaussian, we use the exact distribution, which is a little wider, 


called a t-distribution [W&M ??]. The t-distribution is a 1-parameter family, with the parameter “degress 
of freedom” = n— 1. pdf{t) is complicated to write explicitly. The argument f is similar to the gaussian z = 


(x — )/o; both are measures of dimensionless distance from the mean: 


a ae 
where xX =sample average, s =sample standard deviation . 


~ 
Ill 


RY 


We use ¢, and f-tables, to establish confidence intervals [ref??]. 


Uncertainty of Uncertainty: How Big Is Infinity? 


Sometimes, we need to know the uncertainty in our estimate of the population variance (or standard 
deviation), i.e., we need dev(s). We start by looking more closely at the uncertainty in our estimate s? of the 
(n - 1) s° 


2 
Oo 


freedom [W&M Thm 6.16 p201], or equivalently: 


population variance o”. The random variable has chi-squared distribution with n — 1 degrees of 
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' 2 
s° en) > va(e?)-| 2 | 2(n i= — 


However, usually we’re more interested in the uncertainty of the standard deviation estimate s, rather than 
s*. For that, we must approximate, using the fact that s is function of s?: s =(s?)'”. For moderate or bigger 
sample sizes, and confidence ranges up to 95% or so, we can use the approximate formula for the deviation 
of a function of a random variable (see “Functions of Random Variables,” elsewhere): 


Y=f(X) = dev(Y)= f'((X))dev(X) for small dev(X). 


s=(s?)" => devon’ 


(0) “ dev(s?) = : o = J on : Ss 
D 2aVn-1 — fa(n-1) —2(n—1) 


This rule is most often used in estimating the standard error of the mean, dev(x) (see above), given by 


dev (x) For small samples, this approximation isn’t so good. Then, as noted above, the 


_oO Ss 

Vn vn 
uncertainty dev(x) needs to include both the true sampling uncertainty in x and the uncertainty ins. To be 
confident that our x is within our claim, we need to expand our confidence limits, to allow for the chance 
that s happens to be low. The Student T-distribution exactly handles this correction to our confidence limits 
on x for all sample sizes. 


However, when can we ignore this correction? In other words, how big should v be for the gaussian (as 
opposed to 7) distribution be a good approximation. The uncertainty in s is: 


dev(s) = — : 


2(n-1) 


This might seem circular, because we still have o (which we don’t know) on the right hand side. However, 
it’s effect is now reduced by the fraction multiplying it. So the uncertainty in o is also reduced by this factor, 
and we can neglect it. Thus to first order, we have: 


dev(s) = 


1 1 
o = Ss. 
J2(n-1) J2(n-1) 
So long as this dev(s) << s, we can ignore it. In other words: 


dev(s) << s > on <<1, for x to be approximately gaussian, and s ~ o. 
2(n-1 


(You may notice that dev(s) is correlated with s: bigger s implies bigger (estimated) dev(s), so the 
contribution to dev(x) from dev(s) does not add in quadrature to s/Nn.) When n = 30: 


1 


———— = 0.13 << 1. 
2(30-1) 


13% is pretty reasonable for the uncertainty of the uncertainty, i.e. dev (dev(x)) , and n = 30 is the generally 


agreed upon bound for good confidence that s ~ o [ref??]. 
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Functions of Random Variables 
It follows from the definition of probability that the average value of any function of a random variable 
is: 
(f(X))= [de fapdty@). 


We can apply this to our definitions of population average and population variance: 


¥=(x)=[ de xpdty(, and _var(X) = * de (x-X)" pdfy (x). 


= 


Statistically Speaking: What Is The Significance of This? 


Before we compute any uncertainties, we should understand what they mean. Statistical significance 
interprets uncertainties. It is one of the most misunderstood, and yet most important, concepts in science. It 
underlies virtually all experimental and simulation results. Beliefs (correct and incorrect) about statistical 
significance drive experiment, research, funding, and policy. 


Understanding statistical significance is a prerequisite to understanding science. 


This cannot be overstated, and yet many (if not most) scientists and engineers receive no formal training 
in statistics. The following few pages describe statistical significance, surprisingly using almost no math. 


Overview of Statistical Significance 


The term “statistically significant” has a precise meaning which is, unfortunately, 


different than the common meaning of the word “significant.” 


Many experiments compare quantitative measures of two populations, e.g. the IQs of ferrets vs. gophers. 
In any real experiment, the two measures will almost certainly differ. How should we interpret this 
difference? 


We can use statistics to tell us the meaning of the difference. A difference which is not “statistically 
significant” in some particular experiment may, in fact, be quite important. But we can only determine its 
importance if we do another experiment with finer resolution, enough to satisfy our subjective judgment of 
“importance.” For this section, I use the word importance to mean a subjective assessment of a measured 
result. 


The statement “We could not measure a difference” is very different from “There is no important 
difference.” Statistical significance is a quantitative comparison of the magnitude of an effect and the 
resolution of the statistics used to measure it. 


This section requires an understanding of probability and uncertainty. 


Statistical significance can be tricky, so we start with several high level statements about what statistical 
significance is, and is not. We then give more specific statements and examples. 


Statistical significance is many things: 


Statistical significance is a measure of an experiment’s ability to resolve its own measured result. 


It is not a measure of the importance of a result. 


“Statistically significant” means “measurable by this experiment.” “Not statistically significant” means 
that we cannot fully trust the result from this experiment alone; the experiment was too crude to have 
confidence in its own result. 


Statistical significance is closely related to uncertainty. 


Statistical significance is a quantitative statement of the probability that a result is real, instead of a 
measurement error or the random result of sampling that just happened to turn out that way (by chance). 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 111 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


Statistical significance is a one-way street: if a result is statistically significant, it is (probably) real. 
However, it may or may not be important. In contrast, if a result is not statistically significant, then we don’t 
know if it’s real or not. However, we will see that even a not significant result can sometimes provide 
meaningful and useful information. 


Details of Statistical Significance 


A meaningful measurement must contain two parts: the magnitude of the result, and the confidence limits 
on it, both of which are quantitative statements. When we say, “the average IQ of ferrets in our experiment 
is 102 + 5 points,” we mean that there is a 95% chance that the actual average IQ is between 97 and 107. 
We could also say that our 95% confidence limits are 97 to 107. Or, we could say that our 95% uncertainty 
is 5 points. The confidence limits are sometimes called error bars, because on a graph, confidence limits 
are conventionally drawn as little bars above and below the measured values. 


Suppose we test gophers and find that their average IQ is 107 + 4 points. Can we say “on average, 
gophers have higher IQs than ferrets?” In other words, is the difference we measured significant, or did it 
happen just by chance? To assess this, we compute the difference, and its uncertainty (recall that uncorrelated 
uncertainties add in quadrature): 


AIQ = (107 -102) + V4? +5? =5+6 (gophers — ferrets) 


This says that the difference lies within our uncertainty, so we are not 95% confident that gophers have 
higher IQs. Therefore, we still don’t know if either population has higher IQs than the other. Our experiment 
was not precise enough to measure a difference. This does not mean that there is no difference. However, 
we can say that there is a 95% chance that the difference is between —1 and 11 (5 +6). A given experiment 
measuring a difference can produce one of two results of statistical significance: (1) the difference is 
statistically significant; or (2) it is not. In this case, the difference is not (statistically) significant at the 95% 
level. 


In addition, confidence limits yield one of three results of “importance:” (1) confirm that a difference is 
important; or (2) not important, or (3) be inconclusive. But the judgment of how much is “important” is 
outside the scope of the experiment. For example, we may know from prior research that a 10 point average 
IQ difference makes a population a better source for training pilots, enough better to be “important.” Note 
that this is a subjective statement, and its precise meaning is outside our scope here. 


Five of the six combinations of significance and importance are possible, as shown by the following 
examples. 


Example 1, not significant, and inconclusive importance: With the given numbers, AJQ = 5 + 6, the 
“importance” of our result is inconclusive, because we don’t know if the average IQ difference is more or 
less than 10 points. 


Example 2, not significant, but definitely not important: Suppose that prior research showed 
(somehow) that a difference needed to be 20 points to be “important.” Then our experiment shows that the 
difference is not important, because the difference is very unlikely to be as large as 20 points. In this case, 
even though the results are not statistically significant, they are very valuable; they tell us something 
meaningful and worthwhile, namely, the difference between the average IQs of ferrets and gophers is not 
important for using them as a source for pilots. The experimental result is valuable, even though not 
significant, because it establishes an upper bound on the difference. 


Example 3, significant, but inconclusive importance: Suppose again that a difference of 10 points is 
important, but our measurements are: ferrets average 100 + 3 points, and gophers average 107 + 2 points. 
Then the difference is: 


AIQ = (107-100) + 27 +3? =7+4 (gophers — ferrets) 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 112 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


These results are statistically significant: there is better than a 95% chance that the average IQs of ferrets 
and gophers are different. However, the importance of the result is still inconclusive, because we don’t know 
if the difference is more or less than 10 points. 


Example 4, significant and important: Suppose again that a difference of 10 points is important, but 
we measure that ferrets average 102 + 3 points, and gophers average 117 + 2 points. Then the difference is: 


AIQ = (117-102) + V2? +3? =15+4 (gophers — ferrets) 


Now the difference is both statistically significant, and important, because there is a 95% chance that the 
difference is > 10 points. We are better off choosing gophers to go to pilot school. 


Example 5, significant, but not important: Suppose our measurements resulted in 


AIO =5+4 


Then the difference is significant, but not important, because we are confident that the difference < 10. 
This result established an upper bound on the difference. In other words, our experiment was precise enough 
that if the difference were important (i.e., big enough to matter), then we’d have measured it. 


Finally, note that we cannot have a result that is not significant, but important. Suppose our result was: 


AIO =11+12 


The difference is unmeasurably small, and possibly zero, so we certainly cannot say the difference is 
important. In particular, we can’t say the difference is greater than anything. 


Thus we see that stating “there is a statistically significant difference” is (by itself) not saying much, 
because the difference could be tiny, and physically unimportant. 


We have used here the common confidence limit fraction of 95%, often taken to be ~20. The next most 
common fraction is 68%, or ~lo. Another common fraction is 99%, taken to be ~30. More precise gaussian 
fractions are 95.45% and 99.73%, but the digits after the decimal point are usually meaningless (i.e., not 
Statistically significant!) Note that we cannot round 99.73% to the nearest integer, because that would be 
100%, which is meaningless in this context. Because of the different confidence fractions in use, you should 
always state your fractions explicitly. You can state your confidence fraction once, at the beginning, or along 
with your uncertainty, e.g. 10 + 2 (Lo). 


Caveat: We are assuming random errors, which are defined as those that average out with larger 
sample sizes. Systematic errors do not average out, and result from biases in our measurements. For 
example, suppose the IQ test was prepared mostly by gophers, using gopher cultural symbols and metaphors 
unfamiliar to most ferrets. Then gophers of equal intelligence will score higher IQs because the test is not 
fair. This bias changes the meaning of all our results, possibly drastically. 


Ideally, when stating a difference, one should put a lower bound on it that is physically important, and 
give the probability (confidence) that the difference is important. E.g. “We are 95% confident the difference 
is at least 10 points” (assuming that 10 points on this scale matters). 


Examples 
Here are some examples of meaningful and not-so-meaningful statements: 
Meaningless Statements Meaningful Statements, possibly subjective 
(appearing frequently in print) (not appearing enough in print) 


The difference in IQ between groups A and | Our data show there is a 99% likelihood that 
B is not statistically significant. the IQ difference between groups A and B is 


(Because your experiment was bad, or less than | point. 
because the difference is small?) 


We measured an average IQ difference of 5 | Our experiment had insufficient resolution to 
points. (With what confidence?) tell if there was an important difference in IQ. 
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Group A has a statistically significantly Our data show there is a 95% likelihood that 


higher IQ than group B. the IQ difference between groups A and B is 
(How much higher? Is it important?) greater than 10 points. 


Statistical significance summary: “Statistical significance” is a quantitative statement about an 
experiment’s ability to resolve its own result. We use “importance” as a subjective assessment of a 
measurement that may be guided by other experiments, and/or gut feel. Statistical significance says nothing 
about whether the measured result is important or not. 


Predictive Power: Another Way to Be Significant, but Not Important 


Suppose that we have measured IQs of millions of ferrets and gophers over decades. Suppose their 
population IQs are gaussian, and given by (note the use of lo uncertainties): 


ferrets: 101+ 20 gophers:103 + 20 (lo). 


The average difference is small, but because we have millions of measurements, the uncertainty in the 
average is even smaller, and we have a statistically significant difference between the two groups. 


Suppose we have only one slot open in pilot school, but two applicants: a ferret and a gopher. Who 
should get the slot? We haven’t measured these two individuals, but we might say, “Gophers have 
‘significantly’ higher IQs than ferrets, so we'll accept the gopher.” Is this valid? 


To quantitatively assess the validity of this reasoning, let us suppose (simplistically) that pilot students 
with an IQ of 95 or better are 20% more likely (1.2x) to succeed than those with IQ < 95. From the given 
Statistics, 61.8% of ferrets have IQs > 95, vs. 65.5% of gophers. That is, 61.8% of ferrets get the 1.2x boost 
in likelihood of success, and similarly for the gophers. Then the relative probabilities of success are: 


ferrets: 0.382 + 0.618(1.2) =1.12 gophers: 0.345 + 0.655(1.2) =1.13. 


Thus a random gopher is 113/112 times (less than 0.7% more) likely to succeed than a random ferret. This 
is pretty unimportant. In other words, species (between ferrets and gophers) is not a good predictor of 
success. Species is so bad that many, many other facts will be better predictors of success. Height, eyesight, 
years of schooling, and sports ability are probably all better predictors. The key point is this: 


Unbiased vs. Maximum-Likelihood Estimators 


In experiments, we frequently have to estimate parameters from data. There is a very important 
difference between “unbiased” and “maximum likelihood” estimates, even though sometimes they are the 
same. Sadly, two of the most popular experimental statistics books confuse these concepts, and their 
distinction. 


[A common error is to try to “derive” unbiased estimates using the principle of “maximum likelihood,” which 
is impossible since the two concepts are very different. One popular text goes through the exercise of “deriving” 
the formula for unbiased sample variance from the principle of maximum likelihood, and necessarily gets the wrong 
answer! Hand waving is then applied to wiggle out of the mistake. ] 


Everything in this section applies to arbitrary distributions, not just gaussian. We follow these steps: 
1. Terse definitions, which won’t be entirely clear at first. 

2. Example of estimating the variance of a population (things still fuzzy). 

3. Examples of the need for maximum-likelihood in both single and repeated trials. 

4 


Real-world physics examples of different situations leading to different choices between unbiased 
and maximum-likelihood. 


5. Closing comments. 
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Terse definitions: In short: 


For example, the average of many samples of a population average is likely closer to the right answer than a 
single sample of the population average is. 


A maximum likelihood statistic is one which is most likely to have produced the given the data. Note 
that if it is biased, then the average of many maximum likelihood estimates does not get you closer to the 
right answer. In other words, given a fixed set of data, maximum-likelihood estimates have some merit, but 
biased ones can’t be combined well with other sets of data (perhaps future data, not yet taken). This concept 
should become more clear below. 


Which is better, an unbiased estimate or a maximum-likelihood estimate? It depends on what you goals 
are. 


Example of population variance: Given a sample of values {y;} from a population, an unbiased 
estimate of the population variance is 


oO xs = ——_—_ (unbiased estimate) . 


If we take several samples of the population, compute an unbiased estimate of the variance for each sample, 
and average those estimates, we’ll get a better estimate of the population variance. Usually, unbiased 
estimators are those that minimize the sum-squared-error from the true value (principle of least-squares). 


However, what if we only get one shot at estimating the population variance? 


Examples: Maximum likelihood vs. unbiased: Suppose Monty Hall says “I'll give you a zillion dollars 
if you can estimate the variance (to within some tolerance)? What estimate should we give him? Since we 
only get one chance, we don’t care about the average of many estimates being accurate. We want to give 
Mr. Hall the variance estimate that is most likely to be right. One can show that the most likely estimate is 
given by using n in the denominator, instead of (n — 1): 


n 


0% x) 


ou. = ————- (maximum-likelihood estimate) . 


This is the estimate most likely to win the prize. Perhaps more physically, if you need to choose how long 
to fire a retro-rocket to land a spacecraft on the moon, do you choose (a) the burn time that, averaged over 
many spacecraft, reaches the moon, or (b) the burn time that is most likely to land your one-and-only craft 
on the moon? 


In the case of variance, the maximum-likelihood estimate is smaller than the unbiased estimate by a 
factor of (n — 1)/n. If we were to make many maximum-likelihood estimates, each one would be small by 
the same factor. The average would then also be small by that factor. No amount of averaging would ever 
fix this error. Our average estimate of the population variance would not get better with more estimates. 


You might conclude that maximum-likelihood estimates are only good for situations where you get a 
single trial. However, we now show that maximum-likelihood estimates can be useful even when there are 
many trials of a statistical process: you are a medieval peasant barely keeping your family fed. Every 
morning, the benevolent king goes to the castle tower overlooking the public square, and tosses out a gold 
coin to the crowd. Whoever catches it, keeps it. 


Being better educated than most medieval peasants, each day you record how far the coin goes, and 
generate a PDF (probability distribution function) for the distance from the tower. It looks like Figure 7.5. 
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PDF 


—— distance 
most likely average 


Figure 7.5 Gold coin toss distance PDF. 


The most-likely distance is notably different than the average distance. Given this information, where do 
you stand each day? Answer: At the most-likely distance, because that maximizes your payoff not only for 
one trial, but across many trials over a long time. The “best” estimator is in the eye of the beholder: as a 
peasant, you don’t care much for least squares, but you do care about most money. 


Note that the previous example of landing a spacecraft is the same as the gold coin question: even if you 
launch many spacecraft, for each one you would give the burn most-likely to land the craft. The average of 
many failed landings has no value. 


Real physics examples: Example 1: Suppose you need to generate a beam of ions, all moving at very 
close to the same speed. You generate your ions in a plasma, with a Maxwellian thermal speed distribution 
(roughly the same shape as the gold coin toss PDF). Then you send the ions through a velocity selector to 
pick out only those very close to a single speed. You can tune your velocity selector to pick any speed. Now 
ions are not cheap, so you want your velocity selector to get the most ions from the speed distribution that it 
can. That speed is the most-likely speed, not the average speed. So here again, we see that most-likely has 
a valid use even in repeated trials of random processes. 


Example 2: Suppose you are tracing out the orbit of the moon around the earth by measuring the 
distance between the two. Any given day’s measurement has limited ability to trace out an entire orbit, so 
you must make many measurements over several years. You have to fit a model of the moon’s orbit to this 
large set of measurements. You’d like your fit to get better as you collect more data. Therefore, each day 
you choose to make unbiased estimates of the distance, so that on-average, over time, your estimate of the 
orbit gets better and better. If instead you chose each day’s maximum-likelihood estimator, you'd be off of 
the average (in the same direction) every day, and no amount of averaging would ever fix that. 


Wrap up: When you have a symmetric, unimodal distribution for a parameter estimate (symmetric 
around a single maximum), then the unbiased and maximum-likelihood estimates are identical. This is true, 
for example, for the average of a gaussian distribution. For asymmetric or multi-modal distributions, the 
unbiased and maximum-likelihood estimates are different, and have different properties. For gaussian 
populations, the least-squares estimators (which minimize the sum-squared error from the true value) are 
both unbiased and minimum variance (aka most “efficient”) [Meyers 1996 p??]. 


Correlation and Dependence 


To take a sample of a random variable X, we get a value of X; for each sample point i, i = 1 ... n. 
Sometimes when we take a sample, for each sample point we get not one, but two, random variables, X; and 
Y;. The two random variables X; and Y; may or may not be related to each other. We define the joint 
probability distribution function of X and Y such that: 


Prix < X <x+dx and y<Y<y+dy)=pdfy(x, y). 


This is just a 2-dimensional version of a typical pdf. Since X and Y are random variables, we could look at 
either of them and find its individual pdf: pdfx(x), and pdfy(y). If X and Y have nothing to do with each other 
(i.e., X and Y are independent), then a fundamental axiom of probability says that the probability density of 
finding x < X <x+dx and y< Y<y+dy is the product of the two pdfs: 


X andY areindependent => pdfyy (x, y) = pdf y (x) pdf, (y) 


The above equation is the definition of statistical independence: 
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Two random variables are independent if and only if 


their joint distribution function is the product of the individual distribution functions. 


A very different concept is “correlation.” Correlation is a measure of how linearly related two random 
variables are. We discuss correlation in more detail later, but it turns out that we can define correlation 
mathematically by the correlation coefficient. We start by defining the covariance: 


_ cov( X,Y) 


cov(X,¥)=((X ~X)(v-¥)), P(X,Y)= correlation coefficient . 


Ox Oy 


The correlation coefficient p(X, Y) is proportional to the covariance cov(X, Y). If p (or the covariance) = 0, 
then X and Y are uncorrelated. If p (or the covariance) # 0, then X and Y are correlated. For a discrete 
random variable: 


population 


Note that p and covariance are symmetric in X and Y: 


cov(X,Y)=cov(Y,X), pP(X.Y)=p(Y,.X) (symmetric) . 


Two random variables are uncorrelated if and only if 
their covariance, defined above, is zero. 


Being independent is a stronger statement than uncorrelated. Random variables which are independent 
are necessarily uncorrelated (proof below). But variables which are uncorrelated can be highly dependent. 
For example, suppose we have a random variable X, which is uniformly distributed over [—1, 1]. Now define 
anew random variable Y such that Y= X*. Clearly, Y is dependent on X, but Y is uncorrelated with X. Y and 
X are dependent because given either, we know a lot about the other. They are uncorrelated because for 


every Y value, there is one positive and one negative value of X. So for every value of (x -X ne - Y) F 
there is its negative, as well. The average is therefore 0; hence, cov(X, Y) = 0. 


A crucial point is: 


Variances add for uncorrelated variables, even if they are dependent. 
This is easy to show. Given that X and Y are uncorrelated: 
= = 42 = =\72 
var (X +¥) -([x +¥-(X+Y)| } (lx -X)+(y-¥)] 


=((x-¥)'+2(x-X)(¥-¥)+(¥-¥)'] 


=((x-%)'}+2 (x=<}r<7 +((¥-¥)") 
= var(X )+ var(Y). 


All we needed to prove that variances add is that cov(X, Y) = 0. 


Independent Random Variables are Uncorrelated 


It is extremely useful to know that independent random variables are necessarily uncorrelated. We prove 
this now, in part to introduce some methods of statistical analysis, and to emphasize the distinction between 
“uncorrelated” and “independent.” Understanding analysis methods enables you to analyze a new system 
reliably, so learning these methods is important for research. 
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Two random variables are independent if they have no relationship at all. Mathematically, the definition 
of statistical independence of two random variables is that the joint density is simply the product of the 
individual densities: 


pdf, , (x, y) = pdf, (x) pdf, (y) statistical independence . 


The definition of uncorrelated is that the covariance, or equivalently the correlation coefficient, is zero: 


cov(x,y) = (x Hy )( yop, )) =0 uncorrelated random variables . (7.1) 


These definitions are all we need to prove that independent random variables are uncorrelated. First, we 
prove a slightly simpler claim: independent zero-mean random variables are uncorrelated: 


Given: 4, =| dx pdf,(x)=0, fy =| dy pdf (x) =0, 


then the integral factors into x and y integrals, because the joint density of independent random variables 
factors: 


cov(x,y) = (xy) = [[_@ dy pdf, (x,y) xy = (ke pat (a | [a pat(o) =0. 


For non-zero-mean random variables, (x — x) is a zero-mean random value, as is (y — ly). But these are 
the quantities that appear in the definition of covariance (7.1). Therefore, the covariance of any two 
independent random variables is zero. 


Note well: 


Independent random variables are necessarily uncorrelated, but the converse is not true: uncorrelated 
random variables may still be dependent. 


For example, if X € uniform(-1,1), and Y = X?, then X and Y are uncorrelated, but highly dependent. 


r You Serious? 


Ask the average person on the street, “Is a correlation coefficient of 0.4 important?” You’re likely to 
get a response like, “Wow, 40%. That’s a lot.” In fact, it’s almost nothing. Racing through a quick 
calculation (that we explain more carefully below): p = 0.4 means the variance can be reduced by a fraction 
of p? = 0.16, to 0.84 of its original value. The standard deviation is then (0.84)'” = 0.92 of its original value, 
for a decrease of only 8%! Pretty paltry from a correlation coefficient of p = 0.4. 


To see why, we first note that the standard deviation, o, of a data set is a reasonable measure of its 
variation: o has the same units as the measurements and the average, so it’s easy to compare with them. (In 
contrast, the variance, o°, is an important and useful measure of variation, but it cannot be directly compared 
to measurements or averages.) 


We now address the correlation coefficient, p (rather than r, which is an estimate of p). For definiteness, 
consider a set of measurements yj, and their predictors (perhaps independent variables) x;. p tells us what 
fraction of the variance of y is accounted for by the x;. In other words, if we subtract out the values of y that 
are predicted by the x;, by what fraction is the variance reduced? 


2 1 vat ( yj — Ypred,i) 


7 var( y; ~ Ypred, ) 
var (yy; ) 


j= 
var(y;) 


fe) 


But the important point is by what fraction is o reduced? Since o = (variance)!?: 


2 
open Cnsy . oO 
1- p” = (See ; ae = =p" : o-reduction =1—-—" =1-./1 p° : 
oO oO oO 
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For even moderate values of p, the reduction in o is small (Figure 7.6). In fact, it’s not until p ~ 0.5, where 
the reduction in o is about 13%, that the correlation starts to become import. Don’t sweat the small stuff. 


fraction of sigma remaining 


0 O1 02 03 04 O05 06 07 08 O89 1 


Figure 7.6 Fractional reduction in o vs. p, for a predictor with correlation coefficient p. The curve 
is an arc of a circle. 


Statistical Analysis Algebra 


Statistical analysis relies on a number of basic properties of combining random variables (RVs), which 
define an algebra of statistics. This algebra of RV interaction relates to distributions, averages, variances, 
and other properties. Within this algebra, there is much confusion about which results apply universally, and 
which apply only conditionally: e.g., gaussian distributions, independent RVs, uncorrelated RVs, etc. We 
explicitly address all conditions here. We will use all of these methods later, especially when we derive the 
lesser-known results for uncertainty weighted data. 


We first derive three important rules of statistical algebra. For random variable x and y: 


1. (x + y) = (x) + (y) F for arbitrary dependence between x and y . 


2. (xy) = (x)(y) + cov (x, y) : 
3. var(x+ y) = var(x) + var(y) + 2cov(x, y) . 


[For other interesting results, see 
https://www.probabilitycourse.com/chapter5/5_3_1_covariance_correlation.php .] 


The Average of a Sum: Easy? 


We all know that <x + y> = <x> + <y>. But is this true even if x and y are dependent random variables 
(RVs)? Let’s see. We can find <x + y> for dependent variables by integrating over the joint density: 


(x+ y) = Ile dy pdf, (x, y) (x+ y) = [le dy pdf... (x, y) x + ff a dy pdf, , (x,y) y 


= (x) +(y)- 


Therefore, the result is easy, and essential for all further analyses: 


The average of a sum equals the sum of averages, even for RVs of arbitrary dependence. 
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The Average of a Product 


Life sure would be great if the average of a product were the product of the averages ... but it’s not, in 
general. Although, sometimes it is. As scientists, we need to know the difference. Given x and y are random 
variables (RVs), what is <xy>? 


In statistical analysis, it is often surprisingly useful to break up a random variable into its “varying” part 
plus its average; therefore, we define: 


x=6X+ M,, y=oy+py > (5x) =(dy)=0. 


Note that and jy are constants. Then we can evaluate: 


(xy) = ((8x+ u,)(5y + uy))=(Sxdy) + sy (OY + My (SY + dy 


= UyHy + ((x- Hy)(y- Hy)) = Hetty toov(x, 9). 


The average of the product is the product of the averages plus the covariance. 


Only if x and y are uncorrelated, which is implied if they are independent (see earlier), then the average of 
the product is the product of the averages. 


This rule provides a simple corollary: the average of an RV squared: 


(x7) = a +cov(x,x) = sa + o, ‘ (7.2) 


Variance of a Sum 


We frequently need the variance of a sum of possible dependent RVs. We derive it here for RVs x, y: 
2 2 
var(t y)=((x+ 9 My My) }- [( u)+(y-a,)| 


= ((x-15)°)#((y-a5 )*) +2 (a= me)(y=ay )) = var(x) + var(y) + 2cov(x, y). 


Covariance Revisited 


The covariance comes up so frequently in statistical analysis that it merits an understanding of its 
properties as part of the statistical algebra. Covariance appears directly in the formulas for the variance of a 
sum, and the average of a product, of RVs. (You might remember this by considering the units. For a sum 
x+y: [x] = [] and [var(x + y)] = [x?] = b’] = [cov(, y)]. For a product xy: [xy] = [cov(x, y)].) Conceptually, 
the covariance of two RVs, a and b, measures how much a and b vary together linearly from their respective 
averages. If positive, it means a and b tend to go up together; if negative, it means a tends to go up when b 
goes down, and vice-versa. Covariance is defined as a population average: 


cov(a,b)= (a — HU, )(b- Ly )) . 
From the definition, we see that cov( ) is a bilinear, commutative operator: 
Given: a,b,c,d are random variables; k = constant: 
cov(a,b) = cov(b, a) 
cov(ka,b) = cov(a,kb) = k cov(a,b) 
cov(a+c,b) =cov(a,b)+cov(c,b), cov(a,b +d) =cov(a,b)+cov(a,d). 


Occasionally, when expanding a covariance, there may be constants in the arguments. We can consider 
a constant as a random variable which always equals its average, so: 
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cov(a,k) =0 
cov(a+k,b) =cov(a,b+k)=cov(a,b). 
From the definition, we find that the covariance of an RV with itself is the RV’s variance: 


cov(d, a) = var(a) . 


Capabilities and Limits of the Sample Variance 


The following developments yield important results, and illustrate some methods of statistical algebra 
that are worth understanding. We wish to determine an unbiased estimator for the population variance, o”, 
from a sample (set) of n independent values {y;}, in two cases: (1) we already know the population average 
i; and (2) we don’t know the population average. The first case is easier. We proceed in detail, because we 
need this foundation of process to be rock solid, since so much is built upon it. 


o” from sample and known p: We must start with the definition of population variance as an average 
over the population: 


= (y -n)°) where 1 =(y) =average over population of y= Jim. +z y;. (7.3) 


A simple guess for the estimator of o”, motivated by the definition, might be: 


1 n 
a= ae: —u) (a guess). 
i=l 


We now analyze our guess over many samples of size n, to see how it performs. By definition, to be unbiased, 
the average of g? over an ensemble of samples of size n must equal o”: 


unbiased: (9) =o’ = ((y-x)) : 
ensemble population 


Mathematically, we find an ensemble average by letting the number of ensembles go to infinity, and the 
definition of population average is given by letting the number of individual values go to infinity. Let M be 
the number of ensembles. Then: 


2 ners 1 2 
( ee Per rote oe me =) (> Lt) . 


Since all the y; above are distinct, we can combine the summations. Effectively, we have converted the 
ensemble average on the RHS to a population average, whose properties we know: 


Mn 
: = lim — - uy ={(y-x)’) =o”. 
(g ensemble M-—oMn 2, ( m #) (y #) population 


We have proved that our guess is an unbiased estimator of the population variance, o7. 


(In fact, since we already know that the sample average is an unbiased estimate of the population average, 
and the variance o” is defined as a population average, then we can conclude immediately that the sample 


average of <(y;—u)*> is an unbiased estimate of the population average <(yi— )?> = o°. Again, we took 
the long route above to illustrate important methods that we will use again.) 


Note that the denominator is n, and not n — 1, 
because we started with separate knowledge of the population average ju. 
For example, when figuring the standard deviation of grades in a class, one uses n in the denominator, since 


the class average is known exactly. 
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o” from sample alone: A harder case is estimating o? when y is not known. As before, we must start 
with a guess at an estimator, and then analyze our guess to see how it performs. A simple guess, motivated 
by the definition, might be: 


n n 
sox xe - y) (a guess) where y= ty, ; 
i=l "i=l 


By definition, to be unbiased, the average of s* over an ensemble of samples of size n must equal o?. We 
now consider the sum in s?. We first show a failed attempt, and then how to avoid it. If we try to analyze 
the sum directly, we get : 


(Deis) -E oP-am+54) =P (v8)-aonon(") 


i=1 i=1 i=l i=] 


In the equation above, angle brackets mean ensemble average. By tradition, we don’t explicitly label 
our angle brackets to say what we are averaging over, and we make you figure it out. Even better, as we saw 
earlier, sometimes the angle brackets mean ensemble average, and sometimes they mean population average. 
(This is a crucial difference in definition, and a common source of confusion in statistical analysis: just what 
are we averaging over, anyway?) However, on the RHS, the first ensemble average is the same as the 
population average. However, further analysis of the ensemble averages at this point is messy (more on this 
later). 


To avoid the mess, we note that definition (7.4) requires us to somehow introduce the population average 
into the analysis, even though it is unknown. By trial and error, we find it is easier to start with the population 
average, and write it in terms of y: 


n n n n 
= _ 2 = = = = 
YOoi-e) =X ((Cv-HN+(F-M) = Vi - FY +20F-4) Hi -F) + H(F-HY- 75) 
i=l i=l i=l i=l i=l 
0 
In the second term, (y a Lt) does not depend on i, so it comes out of the summation. Then the second term 


is identically zero, because: 


n 


Di =>» ny = ny — ny =0. 


i=l i=l 


Now we can take the ensemble average of the remains of the sum-of-squares equation: 


(Zov-n')-(Eo-9)-(So-a") 


i=l i=l 


Yo - a) )= (Xe -3} }on((5-m) 


i=l i=l 


All the ensemble averages in the sum on the LHS are the same, and equal the population average, which is 
the definition of o?. On the RHS, we use the known properties of Y: 


(¥)=M, ((5-m)") = vary) = 07 Jn. 


Then we have: 
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How to Do Statistical Analysis Wrong, and How to Fix It 


The following example development contains one error that illustrates a common mistake in statistical 
analysis: failure to account for dependence between random values. We then show how to correct the error 
using our statistical algebra. This example re-analyzes an earlier goal: to determine an unbiased estimator 
for the population variance, o”, from a sample of n values {yi}. 


As before, we start with a guess that our unbiased estimator of a” is proportional to the sum squared 
deviation from the average (similar to the messy attempt we gave up on earlier). Since we know we must 
introduce “ into the computation, we choose to expand the sum by adding and subtracting w: 


n n n 


>04-5F = [01-0 +(e NP = Y] 0: - 4)? #20; - a4 -5) + (uJ. 


i=l i=l i=] 


(de -3')- Yo =a) )¥29°(( - u)(u-F))+n((u- 5)". (7.6) 


i= 


All the ensemble averages on the RHS now equal their population averages. We consider each of the three 
terms in turn: 


é 2 oar ee 
° ( yj -H) ) = (( y;-H) ) = o” , and the summation in the first term on the right is 
ensemble population 


n times this. 


e Inthe 2" term on the RHS, the averages of both factors, (yi— yz) and (44—- Y) , are zero, so we drop 
that term. 


° (4-5) )=((F-H)") = vat) =07 In. 


n 


(Zlov-aV Jane +o =(n+1)o? => s° rsh cae ierenel: 7) 


Clearly, this is wrong: the denominator should be (n — 1). What happened? See if you can figure it out before 
reading further. 


Really, stop reading now, and figure out what went wrong. Apply our statistical algebra. 
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The error is in the second bullet above: just because two RVs both average to zero doesn’t mean their 
product averages to zero (see the average of a product, earlier). In fact, the average of the product must 
include their covariance. In this case, any given y; correlates (positively) with y because y includes each 


y;. Since the y is negated in the 2™ factor, the final correlation is negative. Then for a given k, using the 


bilinearity of covariance (u is constant): 
1 n 
cov(( Ve= 1). p= y)) =-cov(y,,¥)=—cov yu), yj 
j=l 


By assumption, the y; are independent samples of y, and therefore have zero covariance between them: 
cov(y,,y;)=0, k#j, and cov(y,.¥,) =O". 


Thus the only term in the summation over j that survives the covariance operation is when j = k: 


cov((74 -#)u-3)) =-00r{ Ay = 


Therefore, equation (7.7) should include the summation term from (7.6) that we incorrectly dropped. The 
ensemble average of each term in that summation is the same, which we just computed, so the result is n 
times (—0?/n): 


n 2 n 
(de sh) ne! an— +o? =(n-l)o” => ede (y=) (right!) . 


i=] 


Order is restored to the universe. 


Introduction to Data Fitting (Curve Fitting) 


Suppose we have an ideal process, with an ideal curve Yidea(x) mapping an independent variable x to a 
dependent variable y. Now we take a set of measurements of this process, that is, we measure a set of data 
pairs (xi, yi), Figure 7.7 left. 


yx) yx) 


Ideal curve, Data, 
with non-ideal data with straight line guess 


Figure 7.7 (Left) Ideal curve with non-ideal data. (Right) The same data with a straight line fit. 


Suppose further we don’t know the ideal curve, but we have to guess it. Typically, we make a guess (a 
model) of the general form of the curve from theoretical or empirical information, but we leave the exact 
parameters of the curve “free.” For example, we may guess that the form of the curve is a straight line (Figure 
7.7 right): 
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Ymod(*) =mx+b, 


but we leave the slope and intercept (m and b) of the curve as-yet unknown. (We might guess other forms, 
with other, possibly more parameters.) Then we fit our curve to the data, which means we compute the 
values of m and b which “best” fit the data. “Best” means that the values of m and b minimize some measure 
of “error,” called the figure of merit, compared to all other values of m and b. For data with constant 
uncertainty, the most common figure of merit is the sum-squared residual: 


sum-squared-residual = SSE = > residual,” 
i=l 


nN n 


2 2 
= S| (measurement, - model; ) - S| (measurement, — Vmod (x )) 
i=l i=l 
where Yq (X) is our fitting function . 
The (measurement — curve) is often written as (O — C) for (observed — computed). In our example of fitting 
to a straight line, for given values of m and b, we have: 


n 


SSE = >, residual,? = (0% —(mx; + by)” : 


i=l i=l 


Curve fitting is the process of finding the values of all our unknown parameters 


such that they minimize some error function 
(for constant uncertainty, the sum-squared residual from our data). 


The purpose of fitting, in general, is to estimate parameters, some of which may not have simple, closed- 
form estimators. 


We discuss data with varying uncertainty later; in that more general case, we adjust parameters to 
minimize the y” parameter. 


An important special case of curve fitting is “linear fitting” (aka “linear regression”), which we discuss 
in much more detail later. Linear regression is not necessarily fitting to a straight line. 


Goodness of Fit 


Chi-Squared Distribution 


Notational problem: in the literature, 77, sometimes means y” with v degrees of freedom, and sometimes 
means reduced y* with v degrees of freedom. I’m inclined to use y7(v; x) for the full y? distribution, and 
7 rea(V; X) to denote explicitly the reduce 7’ distribution, and 7’, for the statistic. 


You don’t really need to understand the y’ distribution to understand the 7’ statistic, but we start there 
because it’s helpful background. 


Notation: X ¢ D(x) means X is a random variable with probability distribution function (PDF) = D(a). 


X €kD(x) means X is an RV which is k (a constant) times an RV which is € D. 

Chi-squared (y”) distributions are a family of distributions characterized by one parameter, called v 
(Greek nu), the “degrees of freedom.” (Contrast with the gaussian distribution, which has two parameters, 
the mean, “, and standard deviation, o.) So we say “chi-squared is a 1-parameter distribution.” v is almost 


always an integer. The simplest case is v = 1: if we define a new random variable X from a gaussian random 
variable y, as: 


X= x’, where y & gaussian(u= 0,07 =1), ite. avg = 0, variance = 1, 


then X has a y”) distribution. Le., y7(v=1; x) is the probability distribution function of the square of a zero- 
mean unit-variance gaussian. 
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For general v, y7(v; x) is the PDF of the sum of the squares of v independent gaussian random variables: 
Y= ee where y; € gaussian(f = 0,07 =1), ie. avg = 0, std deviation = 1. 
i=l 


Thus, the random variable Y above has a y?(v) distribution. [picture??] Chi-squared random variables are 
always > 0, since they are the sums of squares of gaussian random variables. Since the gaussian distribution 
is continuous, the chi-squared distributions are also continuous. 
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Figure 7.8 PDF (left) and CDF (right) of some y7(k) distributions. y7(1; 0) > 0. y7(2; 0) =%. 
[http://en. wikipedia.org/wiki/Chi-squared_distribution] 


From the definition, we can also see that the sum of two chi-squared random variables is another chi- 
squared random variable: 


Let Aeé x(n), Be v(m), then A+Be vin +m). 


By the central limit theorem, this means that for large v, chi-squared itself approaches gaussian. However, a 
7? random variable (RV) is always positive, whereas any gaussian PDF extends to negative infinity. 


We can show that: 
(7°) =1, var(77()) =) 
> (eM) =v, var (7? (v)) = 2v S dev(z7(v)) =2v 


We don’t usually need the analytic form, but for completeness [WMMY 2007 p200b]: 


V2 xl 


PDF: vv; x) = ——_______, 
rwrno? 


For v > 3, there is a maximum at v —-2. 


For v = | or 2, there is no maximum, and the PDF is monotonically decreasing. 


Chi-Squared Statistic 


As seen above, 7” is a continuous probability distribution. However, there is a goodness-of-fit test which 
computes a statistic also called “chi-squared.” This statistic is from a distribution that is often close to a 7” 
distribution, but be careful to distinguish between the statistic y” and the distribution y’. 


The chi-squared statistic is not required to be from a chi-squared distribution, though it often is. All the 
chi-squared statistic really requires is that the variances of our residuals add, which is to say that our residuals 
are uncorrelated (not necessarily independent, though independent implies uncorrelated). 
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However, for large v, the 7? distribution approaches gaussian, as does the sum of many values of any 
distribution. Therefore: 


To illustrate, consider a set of measurements, each with uncertainty u. Then if the set of {(measurement 
— model)/u} has zero mean, it has standard-deviation = 1, even for non-gaussian residuals: 


Define: dev(X ) = standard deviation of random variable X, also written oy, 


var(X) = (dev(X))° = variance of random variable X, also written oe. 


dey| residual) = = va residual) ae 


u u 


As a special case, but not required for a 7’ statistic, if our residuals are gaussian: 


; ; 2 
residual sain Ss [a - Vv (). 
u 


Uu 


When fitting a curve to data, the uncertainties often vary from measurement to measurement. In that 
case, we are fitting a curve to data triples: (x;, yi, ui). Still, the error divided by uncertainty for any single 
measurement is unit deviation: 


eo te) =_ 1, and va ea — 1, for all 1 é 


U; U; 


If we have n measurements, with uncorrelated residuals, then because variances add: 


n 


n . 2 
idual, fi 
var ——_—_ =n. For gaussian errors: >) € XV (n) : 


Uu: 


i=l i i=l i 


Returning to our ideal process from Figure 7.7, with an ideal curve mapping an independent variable x 
to a dependent variable y, we now take a set of measurements y; with known uncertainties uj. 


YX) 


Then our dimensionless statistic y* for the ideal curve is defined as: 


U; 


n 7 2 n : 2 ‘ P 
2. residual; | _ measurement, — ideal, If gaussian residuals and 
aa? 2 


2 2 . 
U; u,; are accurate, y~ € y(n) 


i=l i=l 


If n is large, this sum will be close to n times the variance (= 1). For zero-mean errors: 


(7?)= y (seve) aie. 


i=l i 


Now suppose we have fit a curve to our data, i.e. we guessed a functional form, and found the parameters 
which minimize the y’ statistic for that form with our data. If our fit is good, then our curve is very close to 
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the ideal (or “real’”) dependence curve for y as a function of x, and our errors will be essentially random (no 
systematic error). We now compute the 7” statistic for our fit: 


nN . 2 n 2 
2 residual; \ _ measurement, — fit, 


i=l My; i=l i 


If our fit is good, the number y? will likely be close to n. (We will soon modify the distribution of the 7 
statistic, but for now, it illustrates our principle.) 


If our fit is bad, there will be significant systematic fit error in addition to our random noise error, and 
our 7” statistic will be much larger than n. Summarizing: 


If xr is close to n, then our fit residuals are no worse than our measurement 


uncertainties, and the fit is “good.” If y? is much larger than n, then our 
fit residuals are worse than our measurement uncertainties, so our fit must be “bad.” 


Degrees of freedom: So far we have ignored the “degrees of freedom” of the fit, which we now 
motivate. (We prove this in detail later.) Consider again a hypothetical fit to a straight line. We are free to 
choose our parameters m and b to define our “fit-line.” But in a set of n data points, we could (if we wanted) 
choose our m and b to exactly go through two of the data points: 


yx) 


This guarantees that two of our fit residuals are zero. If n is large, it won’t significantly affect the other 
residuals, and instead of 7” being the sum of n squared-residuals, it is approximately the sum of (n — 2) 


squared-residuals. In this case, ( rae =xn—2. A rigorous analysis (given later) shows that for the best fit 
line (which probably doesn’t go through any of the data points), and gaussian residuals, then ( x) =n-2, 
exactly. This concept generalizes quite far: 

e Even if we don’t fit 2 points exactly to the line; 

e Even if our fit-curve is not a line; 

e Even if we have more than 2 fit parameters; 


the effect is to reduce the 7’ statistic to be a sum of less than n squared-residuals. The effective number of 
squared-residuals in the y* sum is called the degrees of freedom (dof), and is given by: 


dof =n- (# fit parameters) . 
Thus for gaussian residuals, and p linear fit parameters, the parameters of our 7’ statistic are really: 
(7° )=dof =n-p, dev( 77) = /2(dof) = )2(n=p). (7.8) 


For nonlinear fits, we use the same formula as an approximation. 
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Reduced Chi-Squared Parameter 


Since it is awkward for everyone to know n, the number of points in our fit, it is convenient to define a 
“goodness-of-fit” parameter that is less dependent of n. We simply divide our chi-squared statistic by dof, 
to get the reduced chi-squared statistic. Then it has these properties: 


n 2 
aad 7 = xv’ _ 1 > measurement, — fit; - 
dof dof “> U; 
2 
Xx 
(reduced a = : : = ia =1 
0 0 
dev( y? I2(d 
dev (reduced x) = it hb er) = af ‘ 


If reduced 7? is close to 1, the fit is “good.” If reduced 7 is much larger than 1, the fit is “bad.” By “much 
larger” we mean several deviations away from 1, and the deviation gets smaller with larger dof (larger n). 


Of course, our confidence in y’ or reduced-y? depends on how many data points went into computing it, 
and our confidence in our measurement uncertainties, u;. Remarkably, one reference on y” [which I don’t 
remember] says that our estimates of measurement uncertainties, u;, should come from a sample of at least 
five! That seems to me to be quite small to have much confidence in uw. 


The reduced y’ statistic is a convenient measure of the “misfit + noise” to “noise:” 


2 
2 n sip : 
1 i- fit + 
reduced xv =4_. > a ee - sada where v= degrees of freedom 
yo v5 U; noise 
Each term of y’ is normalized to the noise of that term. If your fit is perfect, reduced ’ will be around 1. If 
you have misfit and noise, then reduced 7? is greater than 1. 


Guidance Counselor: Computer Code to Fit Data 


Finding model parameters from data is called fitting the model to data. The “best” parameters are chosen 
by minimizing some figure-of-merit function. For example, this function might be the sum-squared error 
(between the data and the model), or the y? fit parameter. Generic fitting (or “optimization”) algorithms are 
available off-the-shelf, e.g. [Numerical Recipes]. However, they are sometimes simplistic, and in the real 
world, often fail with arithmetic faults (overflow, underflow, domain-error, etc). The fault (no pun intended) 
lies not in their algorithm, but in their failure to tell you what you need to do to avoid such failures: 


Your job is to write a bullet-proof figure-of-merit function. 


This is harder than it sounds, but quite do-able with proper care. 


As an example, I once wrote code to fit a 3-parameter sinusoid (frequency, amplitude, phase) to 
astronomical data: measures of a star’s brightness at irregular times. That seems pretty simple, yet it was 
fraught with problems. The measurements were very noisy, which leads to lots of local minima. In some 
cases, the optimizer would choose an amplitude for the sinusoid that had a higher sum-of-squares than the 
sum-of-squares of the data! This amplitude is clearly “too big,” but it is hard to know ahead of time how big 
is “too big.” Furthermore, the “too big” threshold varies with the frequency and phase parameters, so you 
cannot specify ahead of time an absolute “valid range” for amplitude. Therefore, I had to provide “guiding 
errors” in my figure-of-merit function to “guide” the optimizer to a reasonable fit under all conditions. 


Computer code for finding the best-fit parameters is usually divided into two pieces, one piece you buy, 
and one piece you have to write yourself: 


e You buy a generic optimization algorithm, which varies parameters without knowledge of what they 
mean, looking for the minimum figure-of-merit (FOM). For each trial set of parameters, it calls 
your FOM function to compute the FOM as a function of the current trial parameters. 
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e You write the FOM function which computes the FOM as a function of the given parameters. 


Generic optimizers usually minimize the figure-of-merit, consistent with the FOM being a “cost” or “error” 
that we want reduced. (If instead, you want to maximize a FOM, return its negative to the minimizer.) 


If your optimizer allows you to specify valid ranges for parameters, and if your fit parameters have valid 
ranges that are independent of each other, then you don’t need the methods here for your FOM function. If 
your optimizer (like many) does not allow you to limit the range of parameters, or if your parameters have 
valid ranges that depend on each other, then you need the following methods to make a bullet-proof FOM. 
In either case, this section illustrates how many seemingly simple calculations can go wrong in unexpected 
ways. 


A bullet-proof FOM function requires only two things: 

e Proper validation of all parameters. 

e A properly “bad” FOM for invalid parameters (a “guiding error”). 
Guiding errors are similar to penalty functions, but they operate outside the valid parameter space, rather than 
inside it. 

Parameter interdependence: Here is a simple example of a model where independent limits don’t 
work; the parameters are necessarily interdependent. Consider a parametrized function of time f(t; a, b), 


where ¢ is the argument, and a and b are the two parameters to be varied for optimization, with f, a, b € [0, 1]. 


Perhaps they represent the times of a first event at f= a, and a second event at t= b. The function f requires 
that a and b are limited to the range [0, 1], but furthermore, a must be < b. The valid values of a and b are 
shown in ??. 


Generic optimizers know nothing about your figure-of-merit (FOM) function, or its behavior, 


and your FOM usually knows nothing about the optimizer, or its algorithms. 
No standard optimizer allows for interdependence of the valid ranges of parameters. 


The example below shows only a single parameter, and its guiding error values, but the concept is readily 
applied to the 2D example here, or any interdependence of parameters to be optimized. We will return to 
this 2D example after the 1D one below. 


1 


a 


(a) 1 


Figure 7.9 (a) Valid region of parameters a and b. 


A simple 1D example: Suppose you wish to numerically find the minimum of the 1-parameter figure- 
of-merit function below left. Suppose the physics is such that only p > 1 is sensible. 


f—=+AP fe) fp) 
| 3 3 

2 2 NOC 
"valid p t 1 

P 6) foo © oso oa 


(a) 1 2 3 4 (b 
Figure 7.10 (a and b) Bad figure-of-merit (FOM) functions. (c) A bullet-proof FOM. 
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Your optimization-search algorithm will try various values of p, evaluating f(p) at each step, looking for 
the minimum. You might write your FOM function like this: 
fom(p) = 1./p + sqrt (p) 


But the search function knows nothing of p, or which values of p are valid. It may well try p=-—1. Then 
your function crashes with a domain-error in the sqrt() function. You fix it with (above b): 
float fom(p) 
if(p <0.) veturn 4. 
return 1./p + sqrt (p) 


Since you know 4 is much greater than the true minimum, you hope this will fix the problem. You run the 
code again, and now it crashes with divide-by-zero error, because the optimizer tried p = 0. Easy fix: 
float fom(p) 
if(p <= 0.) return 4. 
return 1./p + sqrt (p) 


Now the optimizer crashes with an overflow error, p < —(max_float). The big flat region to the left 
confuses the optimizer. It searches negatively for a value of p that makes the FOM increase, but it never 
finds one, and gets an overflow trying. Your flat value for p < 0 is no good. It needs to grow upward to the 
left to provide guidance to the optimizer: 

float fom(p) 
if(p <= 0.) xeturn 4. + fabs(p = 1.) // fabs() = absolute value 
return 1./p + sqrt (p) 


Now the optimizer says the minimum is 4 at p =—10~°. It found the local “minimum” just to the left of 
zero. Your function is still ill-behaved. Since only p > 1 is sensible, you make yet another fix (above c): 
float fom(p) 
Le(p <= 1.) xeturnm 4. + fabsi(p =— 1.) 
return 1./p + sqrt (p) 


Finally, the optimizer returns the minimum FOM of 1.89 at p= 1.59. After 5 tries, you have made your FOM 
function bullet-proof: 


In this example, the FOM is naturally bullet-proof from the right. However, if it weren’t, you would 
include the absolute value of (p— 1) on the error return value, to provide a V-shape which guides the 
optimizer into the valid range from either side. Such “guiding errors” are analogous to so-called penalty 
functions, but better, because they take effect only for invalid parameter choices, thus leaving the valid 
parameter space completely free for full optimization. 


Multi-parameter FOMs: Most fit models use several parameters, p;, and the optimizer searches over 
all of them iteratively to find a minimum. Your FOM function must be bullet-proof over all parameters: it 
must check each parameter for validity, and must return a large (guaranteed unoptimal) result for invalid 
inputs. It must also slope the FOM toward valid values, i.e. provide a “restoring force” to the invalid 
parameters toward the valid region. Typically, with multiple parameters pj, one uses: 


M 
guiding _bad_FOM = big#+ DP: - valid | where valid; =a valid value for p; . 
i=l 


This guides the minimization search when any parameter is outside its valid range. 
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Figure 7.11 “Guiding errors” lead naturally to a valid solution, and are better than traditional 
penalty functions. 


We return now to the multi-parameter function of Figure 7.9, where the valid ranges of a and b are 
interdependent. We can say that b is allowed to be anywhere in [0, 1], and valid, is (say) b/2. Then our 
“ouiding bad FOM” is: 


guiding _bad_FOM = big #+ |b = 0.5| + la —b/ 2| : 


Alternatively, we could just as well say that a € [0, 1], and b must be between a and 1: valid, = (1 + a)/2. 


A final note: 


The “big #” for invalid parameters may need to be much bigger than you think. 


In my dissertation research, I used reduced y? as my FOM, and the true minimum FOM is near 1. I started 
with 1,000,000 as my “big #”, but it wasn’t big enough! I was fitting to histograms with nearly a thousand 
counts in several bins. When the trial model bin count was small, the error was about 1,000, and the sum- 
squared-error over several bins was > 1,000,000. This caused the optimizer to settle on an invalid set of 
parameter values as the minimum! I had to raise “big #” to 10°. 
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8 Multiple Linear Regression | 


Review of Multiple Linear Regression 


Most intermediate statistics texts cover multiple linear regression, e,g, [W&M p353], but we remind you 
of some basic concepts here. Linear regression is also called “linear fitting.” Linear regression is a special 
case of curve fitting where the model is a linear combination of basis functions, and the fit parameters are the 
coefficients. A simple example is: 


p-l 
k=1 


For your data, you measure n observable y; vs. an independent variable x;, i.e. you measure y; = y(x;) for some 
set of x = {x;}, and each yj has an uncertainty u;. You use multiple linear regression to find the best-fit 
coefficients b; of the basis functions f; which compose the model, ymoa(x). The basis functions need not be 
orthonormal. The independent variable might be position, time, or anything else. Note that: 


RRR RR eee eect aR eE eRe ARERR RR eR EERE RRCRREE BREE EERE EERE REC RENTERS RENE MERE HEBER EEE ERE HERE OEGRER EMEA BERE REE RRE RE: GnRH RRS ERE REM EEEIE GREP HE SREEE ER ARR Sah Aaa tn aah ann RRA ER EAE 


Fitting data to a line is often called “fitting data to a line” (seriously). (Unfortunately, some references use 
the term “linear model” to mean a straight-line model; this different use of the term “linear” adds confusion.) 
We now show that there is no mathematical difference between fitting to a line and Jinear fitting to an 
arbitrary function. 


The quirky part is understanding what are the “predictors” (which may be random variables) to which 
we perform the regression. As above, the predictors can be arbitrary functions of a single independent 
variable, but they may also be arbitrary functions of multiple independent variables, called multiple linear 
regression. For example, a model for the speed of light in air varies with 3 independent variables: 
temperature, pressure, and humidity: 


c=c(7,P,A). 
Suppose we take n measurements of c at various combinations of T, P, and H. Then our data consists of 


quintuples: (7;, Pi, Hi, ci, ui), i= 1, ... n, where u; is the uncertainty in c;. We might propose a linear model 
with p = 5 parameters: 


CT, P,H)=b)+bhT +b,P+b3H +b,TP . 


The model is linear because it is a linear combination of arbitrary functions of T, P, and H. The last term 
above handles an interaction between temperature and pressure. Our linear regression model has 3 
independent variables, and 4 fit functions: T, P, H, and TP (the product of T and P). Written in terms of 
(8.1): 


AG.P,A)=T, fph7.P,HE)=P, f,7,P,A) =A, fy(,P,H)=TP . 


For simplicity, this chapter illustrates only one independent variable, but all the results work 
straightforwardly with multiple independent variables, as above. This is because: 


It’s the predictors that matter, not the independent variables that go into them. 


[Units of [b] and [x], predictors in Xm; form.??] 


We Fit to the Predictors, Not the Independent Variable 


In general, we fit to multiple functions of the independent variables, usually not directly to the 
independent variables themselves. Figure 8.1 shows an example fit to a model, with f as the lone independent 
variable. The example model is f(4) = sin(w?) for some given, fixed w. Then the predictors /1; are: 


Ymod t) = Bf, DO = B sin(at) => Ji; = Sin(@t;) (predictors) . 
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There is only 1 fit-function in this example; the predictors are the set of {f1;}, and given them, there is no 
longer any reference to the independent variable t. The fit is to the predictors, not to the independent values 
t;. Furthermore, the fit knows nothing of the functions fi, it only knows their values f;; at the measurement 
points 7;. In some cases, there is no independent variable; there are only predictors. 


The existence of a “good” fit of a model to measurements does not imply any sort of causation 
by the predictors on the measurement. 


For example, the hands of a clock predict very well the brightness of the sky. But the hands do not cause the 
sky brightness, and changing their position will not change the sky brightness. 


predictor: bi t 
Afi SAG) 
! | predictor: 
fh 7 Ai=ht) 
+ > i 
Ne independent 
variable 
(a) | (b) ee 


Figure 8.1 (a) Example predictor: an arbitrary function of independent variable t. (b) Linear fit to 
the predictor is a straight line. The fit is not to t itself. Even if the t; are evenly spaced, the predictors 
are not. Note that the predictor values of —0.5 and +0.5 each occur 3 times. This shows a good fit: 
the measured values (green) are close to the model values. 


We provide a brief overview of linear fitting here, and introduce our notation. A linear fit uses a set of 
p coefficients, b1, ... bp, as fit parameters in a model with arbitrary fit functions. The “model” fit may be 
defined as: 


p 
Vmod (*) = by fi (x) + by fo(x) +... + Bp fp (X) = Sibi fe (x) (without do) . 
k=l 


Note that a linear fit does not require that y is a straight-line function of x. 


There is an important reason (discussed later) to include a fit-function that is just fo(x) = 1, which gives 
a coefficient bo that is the constant offset beyond all the other fit-functions. In this case, there are p—1 
remaining fit functions, since p is always the total number of fit parameters: 
p=! 


k=l 


This is just a different notation than the first model given above. Therefore, the first form is completely 
general, and includes the second. However: 


Anything true of the first form is also true of the second, 
but the second form has important properties beyond those of the first form. 


We use both forms, depending on whether our model includes bo or not. 
Summarizing: 


1. Multiple linear regression predicts the values of some random variable y from p (possibly correlated) 
sets of predictors, fii, k = 1, 2, ... p, by fitting for the coefficients by. The predictors may or may not 
be random variables. 


2. In some cases, the predictors are arbitrary functions of a single independent variable, say t;: fii = 
Jidti). We assume that all the 4, y;, and all the f, are given, which means all the fii = fi(fi) are given. 
In other cases, there are multiple independent variables, and multiple functions of those variables. 
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3. It’s linear prediction, so our prediction model is that y is a linear combination of the predictors, 
{fai}: 
p-l Pp 


Vmod = 2% + fi + br fo ++ Dp ify = bo + bake or bax 


k=1 k=1 


In the first form, we have included bo as a fitted constant, and there are p fit parameters: Do ... bp-1. 
This is quite common, in practice, but not always necessary. The fit functions could just as well be 
labeled with k = 1, ... p. Note that this prediction model has no subscripts of i , because the model 
applies to all f; and y values. 


4. Linear regression assumes that our measurements includes noise (this is our “measurement model”): 


Ji Ytrue,i + nOlse; 


For a given set of measurements, the noise; are fixed, but unknown. Over an ensemble of many sets 
of measurements, the noise; are random variables. The measurement uncertainty is defined as the 
1-sigma deviation of the noise (which need not be gaussian): 


u; = dev (noise;). 
Note that the measurement model assumes additive noise (as opposed to, say, multiplicative noise). 
By definition, the noise cannot be modeled. 


5. The residuals ¢; are the sum of measurement noise + unmodeled behavior. Thus our measurements 
can be written: 


p-l 
Yj = Oy + Dyfi + Dy fay +t Dp shpat =P +| Dba fa [+ & i=1,2,...n. 
k=l 
model Ymod,i 
6. Multiple linear regression determines the unknown regression coefficients bo, bi, ... bp1 from n 
p g g f 


samples of the y and each of the fi. The b; are chosen to optimize some figure-of-merit, which is 
most often minimizing the sum-squared-residual: 


arg min 3 eF 
{by } isl 
Examples: For fitting to a line, in our notation, our model is: 
Ymod (X) = Bg + Bx ie., f(n=x. 
There are p = 2 parameters: bo and b,. 


For a sinusoidal periodogram analysis, we typically have a set of measurements y; at a set of times fj. 
Given a trial frequency w, we wish to find the least-squares cosine and sine amplitudes that best fit our data. 
Without a bo (a bad idea), we would have: 


p=2: f,=cos, fs =sin, Si =cos(at;), fo; =Sin(@t;), = 1=1,2,...n, 
and our fit model is: 
Ymod (t) = B; cos(at) + b, sin(at) . 


(More on omitting bo below). 


Fitting to a Polynomial is Multiple Linear Regression 


Fitting a polynomial to data is a simple example of multiple linear regression [W&M p357] (see also the 
Numerical Analysis section for exact polynomial “fits”). We are predicting y; from powers of (say) t. As 
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such, we let f,; = (1; * , and proceed with standard multiple linear regression by solving the p simultaneous 


standard equations (derived later) for bx: 


n n n n 
2 p-l _ 
bon + byt + by dott .. +b.) 1; =e 
i=l i=] i=l] i=l 


> 
lI 
oO 


n n n n n 
_ i k k+l k+2 k+p-l _ k 
k=lL..p-l: by> t; +b) t; +by> t; +..+B,4> t; = > ty; - 
i=l] i=l i=l i=l] i=l 


Note that the second equation works for k = 0 as well, so we could have omitted the first equation. 


Homoskedastic Case: All Measurements Have the Same Uncertainty 


There are two major categories of uncertainty in multiple regression: all measurements have the same 
uncertainty (homoskedastic), and each measurement has its own uncertainty (heteroskedastic). In science, 
the heteroskedastic case is far more common, and only slightly more complicated than the homoskedastic 
case. Nonetheless, some 800+ page statistics books completely omit the heteroskedastic case of linear 
regression. We here follow tradition, and derive the slightly simpler homoskedastic case first, to illustrate 
the concepts. We then rederive nearly all the results for the heteroskedastic case, which is more useful in 
practice. Furthermore, there is a transformation from heteroskedastic measurements into an equivalent set 
of homoskedastic measurements, which are then subject to all of the following homoskedastic results. 


We proceed along these steps: 
e The raw sum of squares identity. 
e The geometric view of a least-squares fit. 
e The ANOVA sum of squares identity. 
e = The failure of the ANOVA sum of squares identity. 


e —_ Later, we provide the equivalent formulas for data with individual uncertainties. 


For a set of n pairs (x, yi), the “fit” means finding the values of b; that together minimize the sum-squared 
residual (appropriate if all measurements have the same uncertainty): 


Pp 
define: Ymod,i = Vmod (%) = Sibi hic (x;) : 
k=l 
n 2 n 
es to i = = 2 = 
minimize: SSE = Yd (3% - Ymod,i ) = ye ; &; = Yi — Vmod - 
i=l i=l 


Note that the fit residuals ¢; include both unmodeled behavior (aka “misfit’”’), as well as noise (which, by 
definition, cannot be modeled). So even if your noise is gaussian, your residuals are not. 


For least-squares fitting, we show later that we must simultaneously solve the following p linear 
equations in p unknowns for the bk [W&M p355]: 


n n n n 
bon + b> fi * by > foi Tats ep arey = > ai 
i=l i=l i=l i=l 


And for each k = 1, 2, ...p—1: 
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by fi +b fifi +b > fabri Fo +by a> Sut y-t = Dd fad 
i=l i=l i=l i=l i=l 


Again, all the y; and f;; are given. Therefore, all the sums above are constants, on both the left and right sides. 
In matrix form, we solve for column vector b = (bo, bi, ... bp)? from: 


n dA Soe ba ter | bo 3 y; 
Xb=y or fi (hi i > Sud pat b 2 Dd Aids 


2 
i> Sp-1i a So-ihi > (Fp-1i) | Ppl 2 Spidi 
Algorithms for solving simultaneous linear equations are well-known, so we do not address them here. 


[STOP HERE??] 


The Raw Sum-of-Squares Identity 


The sum of squares (SSQ) identity is a crucial tool of linear fitting (aka linear regression). It underlies 
many of the basic statistics of multiple linear regression and Analysis of Variance (or ANOVA). For straight- 
line fitting, the sum of squares identity can be used to define the “coefficient of determination” (and the 
associated “correlation coefficient’). For all linear fitting, the SSQ identity provides the basis for the F-test 
and t-test of fit parameter significance. The SSQ identity leads to ANOVA, which is a standardized way of 
organizing multiple regression results. 


There are two forms of SSQ identity: raw and ANOVA. Most references do not consider the raw sum 
of squares (SSQ) identity (aka the “uncorrected SSQ identity”, or SSQ identity “uncorrected for the mean” 
[W&M8]). We present it first because it provides a basis for the more-common ANOVA SSQ identity, and 
it is sometimes useful in its own right. Consider a set of data (%;, y;), i= 1, ...n. Conceptually, the SSQ 
identity says the sum of the squares of the y; can be partitioned into a sum of squares of model values plus a 
sum of squares of residuals (often called “errors”): 


n n n n n 
(raw) SST =SSA+SSE: Sy? = > ymoai? + (9: -Ymoai) =D Ymoai + 167. (8.2) 
i=l i=l i=l i=l 


i= i 

tol residuals=é; 

(The term “errors” can be misleading, so in words we always use “residuals.” However, we write the term 
as SSE, because that is so common in the literature.) The SSQ identity is only true for a least-squares linear 
fit to a parametrized model, and has some important non-obvious properties. We start with some examples 
of the identity, and provide simple proofs later. 


yt ty 
i 
1 
< } > X < > x 
1 
(a) | (b) -1y (cy 


Figure 8.2 (a) Two data points, n = 2, and best-fit 1-parameter model. (b) Three data points, n = 3, 
and best-fit 1-parameter model. (c) Three data points, n = 3, and best-fit 2-parameter model. 


Example: 7 = 2, p = 1: Consider a data set of two measurements (0, 1), and (1, 2) (Figure 8.2a). We 
choose a |-parameter model: 
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Ymod (*) = Bx - 


The best fit line is b; = 2, and therefore y(x) = 2x. (We see this because the model is forced through the 
origin, so the residual at x = 0 is fixed. Then the least squares residuals are those that minimize the error at 
x = 1, which we can make zero.) Our raw sum-of-squares identity (8.2) is: 


P +2? =(0? +27) +(1? +07) = 5=441, 
SST =a a 


Example: n = 3, p = 1: Consider a data set of three measurements (—1, —1), (0, 0.3), and (1, 1) (Figure 
8.2b). We choose a 1-parameter model: 


Ymod (*) = Dx 7 


The best fit line is b; = 1, and therefore y(x) = x. (We see this because the model is forced through the origin, 
so the residual at x = 0 is fixed. Then the least squares residuals are those that minimize the errors at x =—1 
and x = 1, which we can make zero.) Our raw sum-of-squares identity (8.2) is: 


1 +03? +12 = ((-1) +0? +2) 4+(0? +0340?) > — 2.09=2+0.09. 
SST ee 
SSA SSE 


Example: n= 3, p= 2: We consider the same data: (—1, —1), (0, 0.3), and (1, 1), but we now include a 
bo DC-offset parameter in the model: 


The best fit line is bo = 0.1, b1 = 1, and therefore y(x) = 0.1 + x, shown in Figure 8.2c. (We see this because 
the fit functions are orthogonal over the given {xi}, and therefore the fit parameters {b,,} can be found by 
correlating the data with the fit functions, normalized over the {x;}. Trust me on this.) 


+03? +17 =((-1) +0? +1°)+(0?+0.3°+0?) + — 2.09=2+0.09. 
T a | 
= SSA SSE 


In general, the SSQ identities do not hold for nonlinear fits, as is evident from the following sections. This 
means that none of the linear regression statistics are strictly valid for a nonlinear fit. Nonetheless, they are 
sometimes approximately valid, and used quite often as such. 


The Geometric View of a Least-Squares Fit 


The geometric view of a least-squares fit requires defining a vector space: measurement space (aka 


“observation space”). This is an n-dimensional space, where n = the number of measurements in the data 
set. Our sets of measurements {y;}, predictors {fi}, residuals {¢;}, etc. can be viewed as vectors: 


Y=(M>V2.--Vn)s € =(&,€, usbae etc. 


Thus, the entire set of measurements y is a single point in measurement space (Figure 8.3). We write that 
point as the displacement vector y. If we have 1000 measurements, then measurement space is 1000- 
dimensional. Measurement space is the space of all possible data sets {yi}, with the {x;} fixed. 
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y2 : best-fit 3 yale 
2 (-0.9, 0.1, 1.1) 6 
Ymo y Egy = ( — ) f, y 
C1, — C10 1 
< — ro aed yy < —oT 
1 Par 1 
(a) (c) 


Figure 8.3 (a) Measurement space, n = 2, and best-fit 1-parameter model. (b) Measurement space, 
n = 3, and the 2-parameter model surface within it. (c) The shortest ¢ is perpendicular to every fy. 


Given a set of parameters {b,} and the sample points {x;}, the model (with no residuals) defines a set of 
measurements, Ymoa,, Which can also be plotted as a single point in measurement space. For example, Figure 
8.3a shows our n = 2, p = 1 model y = b,x, taken at the two abscissa value x; = 0, and x2 = 1, which gives 
Ymod,1 = 0, Ymod2 = bi. The least squares fit is b} = 2. Then the coordinates (Ymod,1, Ymoa2) = (0, 2) give the 
model vector Ymoa in Figure 8.3a. Note that the model vector Yimoa is orthogonal to the residual vector s. 


Note that by varying the b;, the model points in measurement space define a p-dimensional subspace of 
it. In Figure 8.3a, different values of bj trace out a vertical line through the origin. In this case, p = 1, so the 
subspace is 1D: a line. 


The n = 3 case is shown in Figure 8.3b. Here, p = 2, so the model subspace is 2D: a plane in the 3D 
measurement space. Different values of bo and b, define different model points in measurement space. For 
a linear fit, the origin is always on the model surface: when all the b; = 0, all the model y; = 0. Therefore, the 
plane goes through the origin. Two more points define the plane: 


by =1,b, =0 = y =(1,1,1) 
by =0,b, =1 = y =(-1,0,1) 


As shown, the model plane passes through these points. Again using linearity, note that any model vector 
(point) lies on a ray from the origin, and the entire ray is within the model surface. In other words, you can 
scale any model vector by any value to get another model vector. To further visualize the plane, note that 
whenever b; = —bo, y3 = 0. Then y; = —b; + bo = 2b0, and y2 = bo; therefore, the line y2 = 0.5 y; lies in the 
model surface, and is shown with a dashed line in Figure 8.3b. 


The green dot in Figure 8.3b is the measurement vector y (in front of the model plane). The best-fit 
model point is (-0.9, 0.1, 1.1). The residual vector ¢ goes from the model to y, and is perpendicular to the 
model plane. 


The model surface is entirely determined by the model (the f,(x)), and the sample points {x;}. 


The measured values {y;} will then determine the best-fit model, which is a point on the model surface. 


In Figure 8.3a and b, we see that the residual vector is perpendicular to the best-fit linear model vector. 
Is this always the case? Yes. If the model vector were shorter (Figure 8.3c), ¢ would have to reach farther 
to go from there to the measurement vector y. Similarly, if the model vector were longer, ¢ would also be 
longer. Therefore the shortest residual vector (least sum squared residual) must be perpendicular to the best- 
fit model vector. This is true in any number of dimensions. From this geometry, we can use the n- 
dimensional Pythagorean Theorem to prove the sum of squares identity immediately (in vector notation): 


£°V mod = 9 => y° =Y mod +297 where y- =yey, etc. 
Luo LI 
SST SSA SSE 
Fit parameters as coordinates of the model surface: We’ ve seen that each point in the model subspace 
corresponds to a unique set of {by}. Therefore, the b; compose a new coordinate system for the model 


subspace, different from the y; coordinates. For example, in Figure 8.3b, the bo axis is defined by setting b, 
=0. This is the line through the origin and the model point y = (1, 1, 1). The 5; axis is defined by setting bo 
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= 0. This is the line through the origin and y = (—1, 0, 1). In general, the b; axes need not be perpendicular, 
though in this example, they are. 


In Figure 8.3b, € is perpendicular to every vector in the model plane. In general, ¢ is perpendicular to 
every f; vector (i.e. each of the p components of the best-fit model vector in the b; coordinates): 


eof, =0 Vk =1, oe DP where f, = b, (fi (x), Tk (x), ee (,)) 


Again, this must be so to minimize the length of ¢, because if ¢ had any component parallel to any fin, then 
we could make that f,, longer or shorter, as needed, to shrink ¢ (Figure 8.3c). We’ll use this orthogonality in 
the section on the algebra of the sum of squares. 


Algebra and Geometry of the Sum-of-Squares Identity 


We now prove the sum of squares (SSQ) identity algebraically, and highlight its corresponding 
geometric features. We start by simply using the definition yj = Ymoai + €i 


n n n 
= => (Ymoa +1) = yogi” +), 87 +) 261 ¥mnodi : (8.3) 
i=l i=l i=1 i=l 


The last term is €-Ymoa, which we’ve seen geometrically is zero. We now easily show it algebraically: since 
SSE is minimized w.r.t. all the model parameters b,, its derivative w.r.t. each of them is zero. I.e., for each 
k: 


m=1 


SSE _ 
ee oe “D855 an2Qia a os, Sats} 


In this equation, all the y; are constant. The only term that survives the partial derivative is where m = k. 
Dividing by —2, we get: 


= er a bedeo) = Seihie com ef, =0, k=L.p. (8.4) 


Therefore, the last term in (8.3) drops out, leaving the SSQ identity. 


The ANOVA Sum-of-Squares Identity 


It is often the case that the DC offset in a set of measurements is either unmeasurable, or not relevant. 
This leads to ANalysis Of Variance (ANOVA), or analysis of how the data varies from its own average. In 
the ANOVA case, the sum-of-squares identity is modified: we subtract the data average y from both the y; 


and the ymod, : 


n 


n 2 n 
(ANOVA) SST=SSA+SSE:  )°(y;- y) = (moa -F) + 87. (8.5) 
i=] i=l i=] 


This has an important consequence which is often overlooked, and proved below: the ANOVA sum-of- 
squares identity holds only if the sum of residuals (not squared) = 0. This is most often achieved by including 
a constant offset fit parameter, which we call bo. (See Phase Dispersion Minimization for another way to 
achieve this.) 


Example: n= 3, p = 2: We again consider the data of Figure 8.2c: (-1, —1), (0, 0.3), and (1, 1). We 
now use the ANOVA sum-of-squares, which is allowed because we have a bo (constant offset) in the model: 


y(x) = by + bx. 


Our ANOVA sum-of-squares identity (8.5) is, using y =0.1: 
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(-1.1)? +0.27 +0.9? =((-1) +0? +17) +((-0.1) +0.2?+(-0.17) > 2.06=2+0.06. 
_E———_—__—— 
ae SSA SSE 


The ANOVA sum-of-squares identity holds 
for any linear least-squares fit that includes a DC offset fit parameter 


(and also in the special case that the sum of residuals (not squared) = 0). 


With no DC offset parameter in the model, in general, 
the ANOVA sum-of-squares identity fails. 


We prove the ANOVA SSQ identity (often called just “the sum of squares identity”) similarly to our 
proof of the raw SSQ identity. We start from the definition of ¢;: 


xe -¥) =9' (Gina 9) #4) 


i=l i=l 


n n n 
= > (yo > x) + we + > 26; ( Visite 7 y) 
i=] i=] i=l 


n n n n 
=>\(Ymoai- 3) + 6? » Sara De 
i=l i=l j i=] 


Compared to the raw SSQ proof, there is an extra 4" term. The 3" term is zero, as before, because € is 
shortest when it is orthogonal to the model. The 4th term is zero when the sum of the residuals is zero. This 
might happen by chance (but don’t count on it). However, it is guaranteed if we include a DC offset parameter 
bo in the model. Recall that the constant bo is equivalent to a fit function fo(x) = 1. We know from the raw 
SSQ proof that for every k: 


n n 


eb, => ei f(x)=0 =  efolm) = S14 =0. 
i=l 


i=l i=l 
QED. 


The necessary and sufficient condition for the ANOVA SSQ identity to hold is that the sum of the 
residuals is zero. A sufficient condition (and the most common) is that the fit model contains a constant (DC 
offset) fit parameter Do. 


The Failure of the ANOVA Sum-of-Squares Identity 


The ANOVA sum-of-squares identity fails when the sum of the residuals is not zero: 


n 
de #0 > (ANOVA) SST # SSA+SSE . 
i=l 


(We proved this when we proved the ANOVA SSQ identity.) This usually mandates including a bo 
parameter, which guarantees the sum of the residuals is zero. You might think this is no problem, because 
everyone probably already has a bo parameter; however, the traditional (now deprecated) Lomb-Scargle 
algorithm [Sca 1982] fails to include a bo parameter, and the cos and sin components generally have nonzero 
DC; therefore all of its statistics are incorrect. The error is worse for small sample sizes, and better for large 
ones. 


As an example of the failure of the sum-of-squares identity, consider again the data of Figure 8.2a: n = 
2 measurements, (0, 1), and (1, 2). As before, we fit the raw data to y = b,x, and the best-fit is still b; = 2. 
We now incorrectly try the ANOVA sum-of-squares identity, with y =1.5, and find it fails: 
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iy iy? 1 
(-5] (5) =((-1.5)° +0.57)+(1? +0?) > —#2.5+1. 
2 2 2 
ae SSA SSE 


For another example, consider again the n = 3 data from earlier: (—1, —1), (0, 0.3), and (1, 1). If we fit 
with just y = bix, we saw already that b; = 1 (Figure 8.2b). As expected, because there is no constant fit 
parameter bo, the sum of the residuals is not zero: 


n n 
ai = d(y; Vmod, )=0+0.3+0#0. 
i=] i=l 


Therefore, the ANOVA sum-of-squares identity fails: 


| 


? 
(-1.1)° +0.2? +0.9% =((-1.1)' +(-0.1)° +0.97]+(0? 0.3? +07) > 2.06 #2.03+0.09. 
SST —=S 


SSA SSE 
In the above two examples, the fit function had no DC component, so you might wonder if including a 
fit function with a DC component would restore the ANOVA SSQ identity. It doesn’t, because the condition 


for the ANOVA SSQ identity to hold is that the sum of residuals is zero. To illustrate, we add a fit function, 
(x* + 1) with a nonzero DC (average) value, so our model is this: 


Vmod (X) = bx + by Es +1) : 


The best fit is b; = 1 (as before), and b2 = 0.0333 (from correlation). Then ymoai = (0.933, 0.0333, 1.0667), 
and: 


2 
(-1.1)° +0.2? +0.9?=((-1.033)" + (0.0667) +0.967 + ((-0.0667)? +0.267° + (0.0667)° 
ee | 
SST 


Pn 
SSA SSE 


> 2.06 # 2.007 + 0.08 . 


Subtracting DC Before Analysis: Just Say No 


A common method of trying to avoid problems of DC offset is to simply subtract the average of the data 
before fitting to it. This generally fails to solve the DC problem (though it is sometimes advisable for other 
reasons, such as improved numerical accuracy in calculations). Subtracting DC makes y =0,so the ANOVA 


SSQ identity is the same as the raw SSQ identity, and the raw identity always holds. However, subtracting 
DC does not give an optimal fit when the fit functions have a DC offset over the {x;}. The traditional Lomb- 
Scargle analysis [Sca 1982] has this error. The only solution is to use a 3-parameter fit: a constant, a cosine 
component, and a sine component [Zechm 2009]. 


wy y=bot+ dit 
y=bit 
oe a : - ia 
(a) | (b) | 


Figure 8.4 (a) The top curve (blue) shows a cosine whose amplitude is fit to data points. The bottom 
curve (red) shows the same frequency fit to DC-subtracted data, and is a much worse fit. (b) You 
would never fit a straight line without including DC. Why would you do it with a sinusoid? 
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Figure 8.4a shows an example of the failure of DC-subtraction to fix the problem, and how DC- 
subtraction can lead to a much worse fit. Therefore, for sinusoidal analysis (which excludes PDM): 


We must include the constant bo parameter both to enable the other parameters to be properly fit, 
and to enable Analysis of Variance with the ANOVA SSQ identity. 


In general, any fit parameter that we must include in the model, but whose value we actually don’t need, is 
called a nuisance parameter. bo is probably the most common nuisance parameter in data analysis. 


Fitting to Orthonormal Functions 


For p orthonormal fit functions, each b; can be found by a simple inner product: 


Pp 
k=l orthonormal 


As examples, this is how Fourier Transform coefficients are found (for uniformly spaced samples), and 
usually how we find components of a ket in quantum mechanics. 


Hypothesis Testing with the Sum of Squares Identity 


A big question for some data analysts is, “Is there a signal in my data?” For example, “Is the star’s 
intensity varying periodically?” One approach to answering this question is to fit for the signal you expect, 
and then test the probability that the fit is just noise. This is a simple form of Analysis of Variance 
(ANOVA). This type of hypothesis is widely used throughout science, e.g. astronomers use this significance 
test in Phase Dispersion Minimization periodograms, and (now deprecated) Lomb-Scargle. 


To make progress in determining if a signal is present, we will test the hypothesis: 
Ao: there is no signal, i.e. our data is pure noise. 


This is called the null hypothesis, because we usually define it to be a hypothesis that nothing interesting is 
in our data, e.g. there is no signal, our drug doesn’t cure the disease, the two classes are performing equally 
well, etc. 


After our analysis, we make one of two conclusions: either we reject Ho, or we fail to reject it. It is 
crucial to be crystal clear in our logic here. If our analysis shows that Ho is unlikely to be true, then we reject 
Ho, and take it to be false. We also quantify our confidence level in rejecting Ho, typically 95% or better. 
Rejecting Ho means there is a signal, i.e. our data is not pure noise. Note that rejecting Ho, by itself, tells us 
nothing about the nature of the signal that we conclude is present. In particular, it may or may not match the 
model we fitted for (but it certainly must have some correlation with our model). 


However, if our analysis says Ho has even a fair chance of being true (typically > 5%), then we do not 
reject it. 


This point cannot be over-emphasized. 


Notice that scientists are a conservative lot: if we claim a detection, we want to be highly confident that 
our claim is true. It wouldn’t do to have scientists crying “wolf” all the time, and being wrong a lot. The 
rule of thumb in science is, “If you are not highly confident, then don’t make a claim.” You can, however, 
say that your results are intriguing, and justify further investigation. 


Introduction to Analysis of Variance (ANOVA) 


ANOVA addresses the question: Why don’t all my measurements equal the average? The “master 
equation” of ANOVA is the sum of squares identity (see Error! Reference source not found. section): 
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SST = SSA + SSE where SST = total sum of squared variation 
SSA = modeled sum of squared variation 


SSE = residual sum of squares 


This equation says that in our data, the total of “squared-differences” from the average is the measured 
differences from the model, plus the unmodeled residuals. Specifically, the total sum of squared differences 
(SST) equals the modeled sum of squared differences (SSA) plus the residual (unmodeled + noise) sum of 
squares (SSE). 


[ANOVA is identical to least-squares linear regression (fitting) to the “categorical variables.” More later.] 


To test a hypothesis, we must consider that our data is only one set of many possible sets that might have 
been taken, each with different noise contributions, ¢;. Recall that when considered over an ensemble of 
hypothetical data sets, all the fit parameters b,,, as well as SST, SSA, and SSE are random variables. It is in 
this sense that we speak of their statistical properties. 


For concreteness, consider a time sequence of data, such as a light curve with pairs of times and 
intensities, (¢;, sj). Why do the measured intensities vary from the average? There are conceptually three 
reasons: 


e =We have an accurate model, which predicts deviations from the average. 


e The system under study is more complex than our model, so there are unmodeled, but 
systematic, deviations. 


e There is noise in the measurement (which by definition, cannot be modeled). 


However, mathematically we can distinguish only two reasons for variation in the measurements: either we 
predict the variation with a model, or we don’t, i.e. modeled effects, and unmodeled effects. Therefore, in 
practice, the 2" and 3" bullets above are combined into residuals: unmodeled variations in the data, which 
includes both systematic physics and measurement noise. 


This section requires a conceptual understanding of vector decomposition into both orthonormal and 
non-orthonormal basis sets. 
The Temperature of Liberty 


As prerequisite to hypothesis testing, we must consider a number of properties of the fit coefficients b; 
that occur when we apply linear regression to measurements y. We then apply these results to the case when 
the “null hypothesis” is true: there is no signal (only noise). We proceed along these lines: 


e A look ahead to our goal. 

e =: The distribution of orthonormal fit coefficients, bx. 

e The non-correlation of orthonormal fit coefficients in pure noise. 
e The model sum-of-squares (SSA). 

e = The residual sum-of-squares (SSE) in pure noise. 


[Temperature of liberty? Get it?] 


A Look Ahead to the Result Needed for Hypothesis Testing 


To better convey where we are headed, the following sections will prove the degrees-of-freedom 
decomposition of the sum-of-squares (SSQ) identity: 
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(raw) SST = SSA+ SSE = Y =Ymoart © 
Lg , Lo 
dof =n dof=p dof =n-p 


We already proved the SSQ identity holds for any least-squares linear fit (regardless of the distribution of 
SSE). To perform hypothesis testing, we must further know that for pure noise, the n degrees of freedom 
(dof) of SST also separate into p dof in SSA, and n — p dof in SSE. 


For the ANOVA SSQ identity, the subtraction of the average reduces the dof by 1, so the dof partition 


as: 
_\2 —)\2 2 
(ANOVA) SST = SSA+SSE a (y-¥) =(Ymoai-F) + € 
ae LJ) 
dof =n-1 dof =p-l dof =n—- p 


Distribution of Orthogonal Fit Coefficients in the Presence of Pure Noise 


We have seen that if a fit function is orthogonal to all other fit functions, then its fit coefficient is given 
by a simple correlation. Le., for a given k: 


S fal, 


fey _ ia 
f, 
: DY fees)? 


We now further restrict ourselves to a normalized (over the {x;}) fit-function, so that: 


f, f; =Ofor all j#k — by (8.6) 


So fil)? =1 > b= ¥ Hey: 
= 


i=l 
We now consider an ensemble of sample sets of pure noise, each with the same set of {x;}, and each producing 
arandom b,;. In other words, the by are RVs over the set of possible sample-sets. Therefore, in the presence 
of pure noise, we can easily show that var(b,) = var(y) = o”. Recall that the variance of a sum (of uncorrelated 
RVs) is the sum of the variances, and the variance of k times an RV = k’var(RV). All the values of f,(xi) are 
constants, and var(y;) = var(y) = o7; therefore from (8.6): 


var(bj, ) = [Sf pon =o’. 
i=l 
1 


This is a remarkable and extremely useful result: 


At this point, the noise need not be zero-mean. In fact: 
n 
(by) = LAs) (vi) 
i= Hy 


Since the sum has no simple interpretation, this equation is most useful for showing that if the noise is zero- 
mean, then b; is also zero-mean: <b> = 0. However, if the fit-function f; taken over the {x;} happens to be 
zero mean, then the summation is zero, and even for non-zero mean noise, we again have <b;> = 0. 


Similarly, any weighted sum of gaussian RVs is a gaussian; therefore, if the y; are gaussian (zero-mean 
or not), then b; is also gaussian. 
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Non-correlation of Orthogonal Fit Coefficients in Pure Noise 


We now consider the correlation between two fit coefficients, by and b» (again, over multiple samples 
(sample sets) of noise), when the fit-functions f; and fn are orthogonal to each other, and to all other fit- 
functions. We show that the covariance cov(b;, bm) = 0, and so the coefficients are uncorrelated. For 
convenience, we take f; and fi, to be normalized: f;2 =f,” = 1. We start with the formula for a fit-coefficient 
of a fit-function that is orthogonal to all others, (8.6), and use our algebra of statistics: 


n n 
cov(b; ,b,,) = Cov (f, *y,f,,, *y) =CoVv YD fi (%) Yj; pine (x;)¥; 
i=l j=l 
Again, all the f; and f;, are constants, so they can be pulled out of the cov( ) operator: 


n n 


cov(b, ,b,,) = DY Ina; eov(y;,y;)- 


i=l j=l 


As always, the y; are independent, and therefore uncorrelated. Hence, when i # j, cov(ji, yj) = 0, so only the 
i =] terms survive, and the double sum collapses to a single sum. Also, cov(yi, yi) = var(yi) = 0”, which is a 
constant: 


cov(b, ,b,,) © o Dae (X;) fin (4%) = 0 (fy, & fin are orthogonal) . 
i=l 


0 


This is true for arbitrary distributions of y;, even if the y; are nonzero-mean. 


The Total Sum-of-Squares (SST) in Pure Noise 


The total sum of squares is: 


raw: SST =yey = oy 


n 
ANOVA: ssT =(y-¥) =}\(y;-35), where F=+ y,. 
ial n* 


For zero-mean gaussian noise, the raw SST (taken over an ensemble of samples) satisfies the definition of a 
scaled y? RV with n degrees of freedom (dof), i.e. SST/o? € y7n. As is well-known, the ANOVA SST, by 


subtracting off the sample average, reduces the dof by 1, so ANOVA SST/o? € 77-1. 


The Model Sum-of-Squares (SSA) in Pure Noise 


We’re now ready for the last big step: to show that in pure noise, the model sum-of-squares (SSA) has p 
degrees of freedom (p = # fit parameters). The model can be thought of as a vector, Ymoa = {¥moa,i}, and the 


basis functions for that vector are the fit-functions evaluated at the sample points, fn = {fn(x)}. Then: 
Dp 
Y mod = yPnfn : 
m=1 


The f,, may be oblique (non-orthogonal), and of arbitrary normalization. However, for any model vector 
space spanned by Ymoa, there exists an orthonormal basis in which it may be written: 
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p 


Y mod = > Oe where g,, = orthonormal basis, c,, =coefficients in the g basis . (8.7) 


m 
m=1 


We’ve shown that since the g,, are orthonormal, the cj» are uncorrelated, with var(cm) = 0°. Now consider 
Ymod” Written as a summation: 


2 
n(_p 
Jon S| Stat) ; 


i=1 \m=1 


Since the g,, are orthogonal, all the cross terms in the square are zero. Then reversing the order of summation 
gives: 


Ymod = yy Gece) = 3 ee a 8m (% yy = er. (8.8) 


m=1i=1 m=1 
1 
Therefore, Ymoa” is the sum of p uncorrelated RVs (the cn”). Using the general formula for the average of the 


square of an RV (7.2): 


(on) = (Cin : + vat(Cn) = (Cm ¥ +o" —) (Ymod”) = p (es a po’. 


m=1 


This is true for any distribution of noise, even non-zero-mean. In general, there is no simple formula for 
var(Y mod’). 


If the noise is zero-mean, then each <c,,> = 0, and the above reduces to: 
2\_ 2 . : 
Ymod )= Pe (zero-mean noise) . 


If the noise is zero-mean gaussian, then the c,, are zero-mean uncorrelated joint-gaussian RVs. This is 
a well-known condition for independence [ref ??], so the cm are independent, gaussian, with variance oO. 
Then (8.8) tells us that ymou? is a scaled chi-squared RV with p degrees of freedom: 


2 
= . : 
(raw) Vmod = EX, (zero-mean gaussian noise) . 
oO 


We developed this result using the properties of the orthonormal basis, but our model Ymoa, and therefore 
Ymod, are identical in any basis. Therefore, the result holds for any p fit-functions that span the same model 
space, even if they are oblique (i.e. overlapping) and not normalized. 


For the ANOVA SSQ identity, a similar analysis shows that the constraint of y removes one degree of 


freedom from SSA, and therefore, for zero-mean noise: 
22) 2 P 
((¥m0a -y) ) =(p-l)o (zero-mean noise) . 
For zero-mean gaussian noise, then: 


_\2 
(Ymoa 5 ») = ae € we (ANOVA SSQ, zero-mean gaussian noise) . 
o o 


: ; ‘ : =?) 
If instead of pure noise, we have a signal that correlates to some extent with the model, then (Ymod - y) 


will be bigger, on average, than (p — 1)o”. That is, the model will explain some of the variation in the data, 
and therefore the model sum-of-squares will (on average) be bigger than just the noise (even non-gaussian 
noise): 
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((¥moa = x) } 7 (SSA) > (p =1)o” (signal + zero-mean noise) . 


The Residual Sum-of-Squares (SSE) in Pure Noise 
We determine the distribution of SSE in pure noise from the following: 
e For least-squares linear fits: SST = SSA+ SSE . 
e From our analysis so far, in pure gaussian zero-mean noise: 
SSTlo*€y77, 1, SSAlo* € 7,4. 
e From the definition of 7”,, the sum of independent y? RVs is another y? RV, and the dof add. 
These are sufficient to conclude that SSE/o? must be 7”,,-», and must be independent of SSA. [I’d like to show 


this separately from first principles??]: 


SSElo* € Ln b (for pure gaussian zero-mean noise) . 


The F-test: The Decider for Zero Mean Gaussian Noise 


In the sections on linear fitting, our results are completely general, and we made no assumptions at all 
about the nature of the residuals. In the more recent results under hypothesis testing, we have made the 
minimum assumptions possible, to have the broadest applicability possible. However: 


One common assumption is that our noise is zero-mean gaussian. Then we can quantitatively test if our 
data are pure noise, and establish a level of confidence (e.g., 98%) in our conclusion. Later, we show how 
to use simulations to remove the restriction to gaussian noise, and establish confidence bounds for any 
distribution of residuals. 


For zero-mean pure gaussian noise only: we have shown that the raw (SSA/ o”) € xv p: We have also 


indicated that for ANOVA: 


SSTlo* ex? 4 o* =(SST/(n-1)) 
SSAlo7 € 77, => ao” =(SSA/(p-1)) 
SSElo” € ame 2 = (SSE/(n- p)) 


Furthermore, SSA and SSE are statistically independent, and each provides an estimate of the noise variance 
c 
[Note that the difference between two independent y° RVs has no simple distribution. This means that SST is 
correlated with SSA in just the right way so that (SST — SSA) = SSE is 077? distributed with p — 1 dof; similarly SST 
is correlated with SSE such that (SST— SSE) = SSA € 027 with n — p dof.] 


We can take the ratio of the two independent estimates of o”, and in pure noise, we should get something 
close to 1: 
SSA/(p-1) 


f= SSEI(n=p) wal (in pure noise) . 


Of course, this ratio is itself a (dimensionless) random variable, and will vary from sample set to sample set. 


The distribution of the RV fis the Fisher-Snedecor F-distribution [ref??]. It is the distribution of the ratio 
of two independent reduced-y* parameters. Its closed-form is not important, but its general properties are. 
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First, the distribution depends on both the numerator and denominator degrees of freedom, so F is a two- 
parameter family of distributions, denoted here as F(dof num, dof denom; f). (Some references use F to 
denote the CDF, rather than PDF.) 


If our test value fis much larger than 1, we might suspect that Ho is false: we actually have a signal. We 
establish this quantitatively with a one-sided F-test, at the a level of significance (Figure 8.5): 


f> critial_value] F,_1,,—p; a| > reject Hy. 


If f > critical value, then it is unlikely to be the result of pure noise. We therefore reject Ho at the @ level of 
significance, or equivalently, at the (1 — a) level of confidence. 


PDF for Fy, np PDF for F..,np i ae PDF for Fn. ee 
critical aos hil | 
f value critical : sa ng 
value 
i> reject Hy 
‘area= a 
ees ‘f ag 


Figure 8.5 One-sided F-test for the null hypothesis, Ho. (a) Critical f value; (b) a statistically 
significant result; (c) not statistically significant result. 


Large-n behavior of F: It is often the case that n is large, and therefore vaen = n — p is large. Then the 
distribution F(Vnun,Vden; f) is dominated by the numerator, since the denominator distribution is very tight. 
Then the F distribution is approximately that of the numerator alone, which is y? with Vaun degrees of freedom. 


Coefficient of Determination and Correlation Coefficient 


We hear a lot about the correlation coefficient, p, but it’s actually fairly useless. However, its square 
(p*) is the coefficient of determination, and is much more meaningful: it tells us the fraction of measured 
variation “explained” by a fit to the predictor f(x). This is sometimes useful as a measure of the effectiveness 
of the model. p? is a particular use of the linear regression we have already studied. (We mention a slightly 
different use for p? at the end.) 


First consider a (possibly infinite) population of (x, y) pairs. Typically, x is an independent variable, and 
y is a measured dependent variable. We often think of the fit function as fi(x) = x (which we use as our 
example), but as with all linear regression, the fit-function is arbitrary. Recall the sum-of-squares definitions 
of SST, SSA, and SSE (8.5). We define the coefficient of determination in linear-fit terms, as the fraction of 
SST that is determined by the best-fit model. This is also the ratio of population variances of a least-squares 
fit: 


Pe 2 SSA = var( Vmod) 
SST var(y) 


(population) 


Note that for the variance of the model ymoa to be defined, the domain of x must be finite, i.e. x must have 
finite lower and upper bounds. For experimental data, this requirement is necessarily satisfied. 


Now consider a sample of n (x, y) pairs. It is a straightforward application of our linear regression 
principles to estimate p?. We call the estimate the sample coefficient of determination, 7°, and define it 
analogously to the population parameter: 


ro =— (sample coefficient of determination) [Myers 1986 2.20 p28] 


n n 
where Ssh=> (venie=9) sst = )(y;-5). 
i=l i=l 


Note that the number of fit parameters is p = 2 (bo and b,). Therefore SSA has p — 1 = | degree of freedom 
(dof), and SST has n — 1 dof. 
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[The sample correlation coefficient is just r (with a sign given below): 


|r| = Vr = VSSA/SST (sample correlation coefficient) . 


For multiple regression (i.e., with multiple “predictors”, where p > 3 but one is the constant bo), we define r always 
> 0. In the case of single regression to one predictor (call it x, p = 2 but still one is the constant bo), r > Oif y 
increases with the predictor x, and r < 0 if y decreases with increasing x.] 

For simplicity, we start with a sample where x = y=0. At the end, we easily extend the result to the 


general case where either or both averages are nonzero. If x =0, then f; is orthogonal to the constant bo, and 
we can find b; by a simple correlation, including normalization of f; (see linear regression to orthogonal fit 
functions, earlier): 


n 


ADI; XY} 
‘ 2 2 n(xy) _ (xy) 


ae | 


1 n n =~ 2 = 
2 2 no, O* 
DY AG) >: 
i=l i=l 


With b; now known, we can compute SSA (recalling that y =by = 0 for now): 


“ 12 0 2 ano _|{ (9) : >__ (xy) 
SSA =)" (Ymoai-¥) =) (b%) =b, Dy: = “32. no, =n 2° 
1 i=1 


2° 


i=l iz Ox Ox 
no.” 
SST is, with y =0: 

n 

SST = >; ye = no,” 
i=l 

Then: 
2 2 2 
Jn SSA _n(xy) kom _ (xy a oa (xy) 7 
SST no,” oo,” 0,0, 


y aoe 


Since y was known exactly, and not estimated from the sample, SST has n dof. 


To generalize to nonzero x and y , we note that we can transform x — x—x,and y— y—y. These are 


simple shifts in (x, y) position, and have no effect on the fit line slope or the residuals. These new random 
variables are zero-mean, so our simplified derivation applies, with one small change: y is estimated from 


the sample, so that removes | dof from SST: SST has n — 1 dof. Then: 


2,884 _((#-*)0-9))° 


SST aa? 


where (-)o-9) Ya ¥)(y; y) 


ly ay2 bd 
=> (x,-x), 0,” similar. 
a1 : 


Note that another common notation is: 
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2 n 


Sxy 
r = cor 5,5, where S,,= ((x-*)(y-¥)), Sixx =6,7 = ae ae S,,, similar . 


Distribution of r*: Similarly to what we have seen with testing other fit parameters, to test the hypothesis 
that 7? > 0, we first consider the distribution of 7? in pure noise. For pure zero-mean gaussian noise, 7” follows 
a beta distribution with | and n—1 degrees of freedom (dof) [ref ??]. We can use the usual one-sided test at 
the @ significance threshold: if 


Psig =1—Cdf peta (-) > critical_value[beta(1,n —1);a] (gaussian) , (8.9) 


then we reject the null hypothesis Ho, and accept that p? is probably > 0, at the psig level of significance. 


However: 


The beta distribution is difficult to use, since it crams up near 1, and many computer 


implementations are unstable in the critical region where we need it most [ref??]. Instead, we can 
use an equivalent F test, which is easy to interpret, and numerically stable. 


Again applying our results from linear regression, we recall that: 


‘A 2 
S8T.= $A +855 > fe EEA eg. 
dof= dof=| dof= SSE / dof (1-r )/(n-2) 


n-1 n-2 


Then for pure noise, f~ 1. If f >> 1, then p? is probably > 0, with significance given by the standard 1-sided 
F test (ais our threshold for rejecting Ho): 


Psig =1-cdf, (f ) > critical_value[F(,n-2;a]. 


Note that the significance psig here is identical to the significance from the beta function (8.9), but using the 
F distribution is usually a more accurate and easier way to compute it. 


Alternative interpretation of x and y: There is another way that p” can be used, depending on the 
nature of your data. Instead of x being an independent variable and y being corresponding measured values, 
it may be that both x and y are RVs, with some interdependence. Then, much like y is a population parameter 


of a single random variable y, p? is a population parameter of two dependent random variables, x and y, and 
their joint density function. Either way, we define the coefficient of determination in linear-fit terms, as a 
ratio of population variances of a least-squares fit of y to x. (We ignore here the question of the dof in 0,7.) 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 151 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


9 Uncertainty Weighted Linear Regression | 


(Needs a summary of common equations??) 


When taking data, our measurements often have varying uncertainty (heteroskedastic): some 
measurements are “better” than others. We can still find an average of a set of such measurements, but what 
is the best average, and what is its uncertainty? These questions extend to almost all of the statistics we’ve 
covered so far: sample average and variance, fitting, etc. If you have a set of measurements, but each 
measurement has a different uncertainty, how do you combine the measurements for the most reliable 
estimate of a parameter? Intuitively, estimates with smaller uncertainty should be given more weight than 
estimates with larger uncertainty. But exactly how much? 


Note that the uncertainty of a measurement may or may not be influenced by the value of the 
measurement itself. or example, for many systems, larger measured values have larger uncertainty. 
However, all of our results are independent of whether the uncertainty has a statistical dependence on the 
measured value or not. All we need is the uncertainty; we don’t care how it comes about. (Many statistics 
references are unclear on this point.) 


Each topic in this section assumes you thoroughly understand the unweighted case described earlier, 
before delving into the weighted case. 


Throughout this section, we consider data triples of the form (xi, yi, ui), where x; are the independent 
variables, y; are the measured variables, and u; are the 1o uncertainties of each measurement. We define the 
uncertainty as variations that cannot be modeled in detail, though their PDF or other statistics may be known. 


Two examples of the failure of the “obvious” adjustments to formulas for uncertainty-weighted data are the 
unbiased estimate of a population o” from a sample (detailed below), and the (now deprecated) Lomb-Scargle 
detection parameter. 


Be Sure of Your Uncertainty 


We must carefully define what we mean by “uncertainty” u;. Figure 9.1 depicts a typical measurement, 
with two separate sources of noise: external (wex:), and instrumental (Uins:). The model experiment could be 
an astronomical one, spread over millions of light-years, or it could be a table top experiment. The external 
noise might be background radiation, CMB, thermal noise, etc. The instrument noise is the inevitable 
variation in any measurement system. One can often calibrate the instrument, and determine its uncertainty 
Uins. Sometimes, one can measure Uex, aS Well. However, for purposes of this chapter, we define our 
uncertainty 1; as: 


ui = all of the noise outside of the desired signal, s(Z). 


Our results depend on this. 


signal, \ Instrument 
—L- \ 
s(t) \ 
— — +4 @ 
a ot s(t) + S(t) + U,,(D) 
la external noise, / UexdD) Uinskt) + UinsE) 


Wert) 


Figure 9.1 A typical measurement includes two sources of noise. 


Average of Uncertainty Weighted Data 


We give the formula for the uncertainty-weighted average of a sample, and the uncertainty of that 
average. Consider a sample of n uncertainty weighted measurements, say (xi, yi, ui), Where y; is the 
measurement, and u; is the 1o uncertainty in y;. How should we best estimate the population average from 
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this sample? If we assume the estimator is a weighted average (as opposed to RMS or something else), we 
now show that we should weight each y; by u;?. The general formula for a weighted average is: 


y=(y,)= a where w, are to be determined . (9.1) 


The variance (over an ensemble of samples) of this weighted average, where the weights are constants, is 
(recall that uncorrelated variances add): 


var(y) = =... (9.2) 


Note that because of the normalization factor in the denominator, both y and its variance are independent 


of any multiplicative constant in the weights (scale invariance): e.g., doubling all the weights has no effect 
on y orits variance. However, we want to choose the individual weights to give y the minimum variance 


possible. Therefore the derivative of the above variance with respect to any such weight, wi, is zero. Using 
the quotient rule for derivatives: 


n n 


2 

y] 
_ Ww; | Zrvguy? — 
Ovar(y) — Viz = 
Ow, 


n 
22 
WU; Zz yw; 
i=l 


=0 [av) 


_VdU-U dv 
a 


= 
= 


2 


Since the weights are scale invariant, the only dependence that matters is that wz « ux ~. Therefore, we take 


the simplest form, and define: 


Ww; =u; (raw weights) . 


For a least-squares estimate of the population average, we weight each measurement 
by the inverse of the uncertainty squared (inverse of the measurement variance). 


Some references use scaled weights proportional to u;*, so we allows for other weight scalings below. 


As expected, large uncertainty points are weighted less than small uncertainty points. Our derivation 
applies to any measurement error distribution; in particular, errors need not be gaussian. The least-squares 
weighted average is well-known [Myers 1986 p171t]. [Note that we have not proved that a weighted average, 
in general, is necessarily the optimum form for estimating a population average, but it is [ref??]. (I suspect 
this can be proved with calculus of variations, but I’ve not seen it done.)] 


Given these optimum weights, we can now write the uncertainty of y more succinctly. For convenience, 


we define: 
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n n n 
We i, Vi = > w; (anormalization factor), Vv, = is i 
i=l i=l i=l 


Note that W is defined to be independent of weight scaling, V; scales with the weights, and V2 scales with the 
square of the weights. Then from eq. (9.2), the variance of y is: 


= — (9.3) 
ve Vi 
The variance must be scale invariant, but our result includes V;, which scales. That’s because we chose a 


scale when we used u;* = w;!, for which V; = W. We can infer the scale-invariant result as follows: W is 
scale invariant, so by substituting V; = W, the scale invariant result is: 


var(y) = > and uncertainty(y) = dev(y) = [var (¥)) = ¥ : 


The raw weights, w;, have units of [measurement]. 


Note that the weighted uncertainty of y reduces to the well-known unweighted uncertainty when all the 


uncertainties are equal, say u: 
-1 
1 . 2 ur u 
var (y)=—= u- = — — dev(y) = ——. 
(¥) W > z (y) ie 


Variance and Standard Deviation of Uncertainty Weighted Data 


Handy numerical identity: Recall that when computing unweighted standard deviations, we simplify 
the calculation using the handy identity: 


n 


2 
YG sy or Sree) 
tl i=l 


i=l . 
What is the equivalent identity for weighted sums of squared deviations? We derive it here: 
n ) n n 
= 2 eat ely = 
> wi(;-¥) =>imi(y; =2y,9 + ¥ ) Use: yy =Vy 
i=l i=l i=l 


=o wy? - MF? +V57 (9.4) 


2 
=o wy? Vir or Seg 


1 


We note a general pattern that in going from an unweighted formula to the equivalent weighted formula, the 
number n is often replaced by the number Vj, and all the summations include the weights. 


Weighted sample variance: We now find an unbiased weighted sample variance; unbiased means that 
over many samples (sets of individual values), the sample variance averages to the population variance. We 
first state the result: 


n 
> wi (xi) 
92 atl 
Vi—-V2/V, 
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We prove below that this is an unbiased estimator. 


eee a ee 


Many references give incorrect formulas for the weighted sample variance; 


in particular, it is not just (1/V, dow, (y,- yy 
For computer code, we often use the weighted sum-of-squared deviations identity (9.4) to simplify the 
calculation: 


2 2 

n n n 

=39 2 2 

> w;(y;-¥) > Wii {3 wo /V, “> Wii {3 vs] 

2_ i=l i=l i i=l i 

so= — or 5 : 
Vi-V2/V Yi-Vo/V, Vi" -V 


We now prove that over many sample sets, the statistic 5? averages to the true population o?. (We use 
our statistical algebra.) Without loss of generality, we take the population average to be zero, because we 
can always shift a random variable by a constant amount to make its (weighted) average zero, without 
affecting its variance. Then the population variance becomes: 


7 =(¥7\-(¥)y) > a? =(Y"). 


We start by guessing (as we did for unweighted data) that the weighted average squared-deviation is an 
estimate for o”. For a single given sample set, the simple weighted average of the squared-deviations from 
y is (again using (9.4)): 


n 
A2 
: Lw(%-9) Yee? > 


2 
7 ) 
9.5 
7 y (9.5) 


=l 
», Wi vi 1 


Is this unbiased? To see, we average over the ensemble of all possible sample sets (using the same weights). 
Le., the weights, and therefore V; and V2, are constant over the ensemble average. The first term in (9.5) 
averages to: 


n n 
2 
Dw \ dom 
i=l i=l 2 =(¥?)=0° 
Vv, V; Jj 


By separating the squared terms from the cross terms, we find the second term in (9.5) averages to: 


2 


* Wii 1|/x 
(77) = [a “ye (Sree) Do ww iid; 


ixj 


Recall that the covariance, or equivalently the correlation coefficient, between any two independent random 
variables is zero. The last term is proportional to <y,y;> (and therefore the covariance), which is zero for 
the independent values y; and y;. Thus: 


ae: a 


Finally, the unbiased estimate of o” simply divides out the prefactor: 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 155 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


(<?) ; Yviboi-37 


(s?)=—1 4 > s =, (9.6) 
1-V,/(V,?] V, -V2/V, 

as above. Note that we have shown that s” is unbiased, but we have not shown that s? is the least-squares 

estimator, nor that it is the best (minimum variance) unbiased estimator. But it is [ref??]. 


Also, as always, the sample standard deviation s = ae is biased, because the square root of an average 
is not the average of the square roots. Since we are concerned most often with bias in the variance, and rarely 
with bias in the standard deviation, we don’t bother looking for an unbiased estimator for o, the population 
standard deviation. 


Distribution of weighted s”: If the uncertainties are exact, and the residuals gaussian, s? is v7(n—1) 
distributed. Realistically, the uncertainties are only estimates, so s* is not exactly y° distributed, and therefore 
we cannot rigorously associate degrees of freedom with it. However, for large n, and/or good uncertainties 
ui, We can approximate the weighted s as having a y*(n—1) distribution (like the unweighted s? does). 


Normalized weights 
Some references normalize the weights so that they sum to 1, in which case they are dimensionless: 


n -2 
a U; : ; : : 
We ) u; . and Ww; = ‘i (normalized, dimensionless weights) . 
i=1 


This makes V; = 1 (dimensionless), and therefore V; does not appear in any formulas. (V2 must still be 


computed from the normalized weights.) Both normalizations are found in the literature, so it is helpful to 
be able to switch between the two. 


As an example of how formulas are changed, consider a chi-squared goodness-of-fit parameter. Its form 
is, in both raw and normalized weights: 


n n 
(raw) xr = > (9; = Vinod? y > xr = Ww; (y; — Ymod.i i (normalized) . 
i=l i=l 


Other similar modifications appear in other formulas. In general, we can say: 


to go from raw to normalized: wae. Vw aw, Ve swe 
w" aw V. raw 
to go from normalized to raw: ss Way, VS ae : 
1 Vi 


As another example, we transform the raw formula for s*, eq. (9.6), to normalized: 


n 
>; (9-5) w>w, (9: -¥) >; (9-5) 
(raw) soil , geet ; = (normalized) . 
V, -V2/V, W-W-V,/W l= V5 


To go back (from the normalized s to raw), we take W — V, (if W were there), w; > wi/Vi, and V2>V2/V(’.. 


For now, the raw, dimensionful weights give us a handy check of units for our formulas, so we continue 
to use them in most places. 
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Numerically Convenient Weights 


It is often convenient to perform preliminary calculations by ignoring the measurement uncertainties uj, 
and using unweighted formulas. We might even do such estimates mentally. Later, more accurate 
calculations may be done which include the uncertainties. It is often convenient to compare the preliminary 
unweighted values with the weighted values, especially for intermediate steps in the analysis, e.g. during 
debugging of analysis code. However, unnormalized weights, w; = u;*, have arbitrary magnitudes that lead 
to intermediate values with no simple interpretation, and that are not directly comparable to the unweighted 
estimates. Therefore, it is often convenient to scale the weights so that intermediate results have the same 
scale as unweighted results. The unweighted case is equivalent to all weights being 1, with a sum of n. We 
can scale our uncertainty weights to the same sum, i.e. n, or equivalently, we scale our weights to an average 
of 1: 


n n 
a =n (unweighted) = (weighted) ir =n and therefore w; = eu 


L 
i=l i=l 


With this weight scaling, “quick and dirty” calculations are easily compared to more accurate fully-weighted 
intermediate (debug) results. 


In general, for arbitrary weight scaling, we have: 


n n 

' : W “4 4 

w,/” = Ww rom = =F we where W= ) U; * yer = ) wes 
Yi i=l i=l 


Uncertainty Weighted Straight-Line Fit 


The common case of an uncertainty-weighted straight-line fit is worth noting for reference [Strutz 7.16 
pl77): 


Ymod (x) = bo + bx F 


n n n n n 
= = = = 2 oe 
Y= y Wi» S,= » wx, Sy = , Wi» Sx = » WX Syy = Y WiXiDi> 
i=l i=l i=l i=l i=l 


SxS —SySyy VS, —5,Sy 
OF ae ee. gn 


xx y 
yaad a 2 
Vidy >. Ss, Vid ay ~ ie 


These are scale-free (i.e., they work with all weight scalings), because all terms scale as the square of the 
weights, so the scaling cancels. 


Transformation to Equivalent Homoskedastic Measurements 


We expect that the homoskedastic case (all measurements have the same uncertainty, o) is simpler, and 
possibly more powerful than the heteroskedastic case (each measurement has its own uncertainty, uj). 
Furthermore, many computer regression libraries cannot handle heteroskedastic data(!). Fortunately, for the 
purpose of linear regression, there is a simple transformation from heteroskedastic measurements to an 
equivalent set of homoskedastic measurements. This not only provides theoretical insight, but is very useful 
in practice: it allows us to use many (but not all) of the homoskedastic libraries by transforming to the 
equivalent homoskedastic measurements, and operating on the transformed data. 


To perform the transformation, we choose an arbitrary uncertainty to act as our new, equivalent 
homoskedastic uncertainty o. As a convenient choice, we might choose the smallest of all the measurement 
uncertainties Unin to be our equivalent homoskedastic uncertainty o, or perhaps the RMS(u;). (Recall that u; 
is defined as all of the measurement error, both internal and external.) Then we define a new set of equivalent 
“measurements” (xj, yi, Ui) > (xi, y’i, 0) according to: 
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We can now use all of the homoskedastic procedures and calculations for linear regression on the new, 
equivalent “measurements.” Note that we have scaled both the predictors x,,;, and the measurements y;, by 
the ratio of our chosen o to the original uncertainty u;. Measurements with smaller uncertainties than o get 
scaled “up” (bigger), and measurements with larger uncertainties than o get scaled “down” (smaller). 


If the original noise added into each sample was independent (as we usually assume), then multiplying 
the y; by constants also yields independent noise samples, so the property of independent noise is preserved 
in the transformation. 


Figure 9.2 shows an example transformation graphically, and helps us understand why it works. 
Consider 3 heteroskedastic measurements: 


(1.0, 0.5, 0.1), (1.6, 0.8, 0.2), (2.0, 1.0, 0.3) (original measurements). 


We choose our worst uncertainty, 0.3, as our equivalent homoskedastic o. Then our equivalent measurements 
become: 


(3.0, 1.5,0.3), (2.4, 1.2,0.3), (2.0, 1.0, 0.3) (equivalent measurements). 


Figure 9.2 illustrates that an uncertainty of 0.3 at x’; = 3.0 is equivalent to an uncertainty of 0.1 at x; = 1.0, 
because the x’ point “tugs on” the slope of the line with the same contribution to y, the square of (Vmod, — 
yi)/u;. In terms of sums of squares, the transformation equates every term of the sum: 


' ' n 2 n ' 1 \2 
Vinod (%4)— 93 _ Ymod i) =i Vi > y[msed=2) =): 2seew) 
U; o =i Uj ra o 
The transformation coefficients are dimensionless, so the units of the transformed quantities are the same as 
the originals. Note that: 


The regression coefficients b;, and their covariances, are unchanged by the transformation to 


equivalent homoskedastic measurements, but the model values y’ moa,i = Ymoa(x’ i) change 
because the predictors x’; are transformed from the original xj. 


Equivalently, the predictions of the transformed model are different than the predictions of the original model. 
The uncertainties in the b,, are given by the standard homoskedastic formulas with o as the measurement 
uncertainties, and the covariance matrix var(b) is also preserved by the transformation. These considerations 
show that SST, SSA, and SSE are not preserved in the transformation. 
» 
ot 
(Vi, 2) 


bore 


| i i 
tT 


‘f 1 2| 3 


Figure 9.2 The model ymoa vs. the original and the equivalent homoskedastic measurements. 


In matrix form, the transformation is: 
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= Uy F y'=Ty, 


To illustrate the equivalence, recall that the standard sample average is a linear fit to a constant function 
jo(t) = 1. Therefore, the weighted sample average is given by the unweighted average of the transformed 
measurements. Proof TBS??. Note that the transformed function, f'o(f) is not constant. 


In contrast, note that the heteroskedastic population variance estimate (9.6), 


is not a linear fit. That’s why it requires this odd-looking formula, and is not given by the common 


homoskedastic variance estimate applied to the transformed data: s* # by y;'-y F /(n-1). 


As another counter-example, the standard Lomb-Scargle algorithm (now deprecated) doesn’t work on 
transformed data. Although Lomb-Scargle is essentially a simultaneous linear fit to a cosine and a sine, it 
relies on a nonlinear computation of the orthogonalizing time offset, 7; essentially, it relies on the fit-functions 
being cos and sin, and satisfying trigonometric identities. These properties do not hold for the transformed 
data. 


Orthogonality is preserved: If two predictors are orthogonal w.r.t. the weights, then the transformed 
predictors are also orthogonal. Proof: with predictors x4; and Xi: 


n 


n n ' ' n 

> _ ~ or 2 =H Ami K 2 _ 
WiXkiXmi = 0 > X Kix mi =O o WiXkiXmi = UE 

i=l i=l ! qu i=l 


i=] 


0 


Linear Regression with Individual Uncertainties 


We have seen that for data with constant uncertainties, we fit it to a model using the criterion of least- 
squared residual. If instead we have individual uncertainties ();, ui), we commonly use a least-chi-squared 
criterion. That is, we fit the model coefficients (b;,) to minimize: 

n 2 
i 


SSE= 7’ = > where &; = residual = y; — Ymoa,j - 


i=l 4i 
For gaussian residuals, least-chi-squared fits yields maximum likelihood fit coefficients. For non-gaussian 
residuals, least-chi-squared is usually as good a criterion as any. 


However, there are many statistical formulas that need updating for uncertainty-weighted data. Often, 
we need an exact closed-form formula for a weighted-data statistical parameter. For example, computing an 
iterative approximate fit to data can be prohibitively slow, but a closed-form formula may be acceptable (e.g., 
periodgrams). Finding such exact formulas in the literature is surprisingly hard. 
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We discuss and analyze some direct weighted-regression computations here. As in the earlier unweighted 
analysis, we clearly identify the scope of applicability for each formula. And as always, understanding the 
methods of analyzing and deriving these statistics is essential to developing your own methods for processing 
new Situations. 


This section assumes a thorough understanding of the similar unweighted sections. Many of our 
derivations follow the unweighted ones, but may be briefer here. 


The first step of linear regression with individual uncertainties is summarized in [Bev p117-118], oddly 
in the chapter “Least-Squares Fit to a Polynomial,” even though it applies to all fit functions (not just 
polynomials). We summarize here the results. The linear model is the same as the unweighted case: given 
p functions we wish to fit to n data points, the model is: 


Pp 
Ymod (X) =) by fe (2) = Dyfi) + by fa (2) + bp fp) [Bev 7.3 p117]. 


k=l 


Each measurement is a triple of independent variable, dependent variable, and measurement uncertainty, (xi, 
yi, ui). As before, the predictors do not have to be functions of an independent variable (and in ANOVA, 
they are not); we use such functions only to simplify the presentation. We find the b; by minimizing the y” 
parameter: 


n 


2 
p 
o 3)-Sehaha 


(yq)- isa yy oe m=1 


7 [Bev 7.5 p117]. 
i=l i=l Uj 


For each k from 1 to p, we set the partial derivative, 0y’/Ob; = 0, to get a set of simultaneous linear equations 
in the by: 


p 
ay” n [> Sratats feo) 
=(= > 2 
. i=l 


; k =1,2,.. 
ab, ue p 
Dividing out the —2, and simplifying: 
Pp 
| Pm fim i) | fe 7) 
O= ia k=1,2,...p 


; ; 
i=l Uj 


Moving the constants to the LHS, we get a linear system of equations in the sought-after bx: 


ys iG Sy babal Dy = VD po) bea. 


i=l m=1 m=1 i=l Yi 


Linear Regression With Uncertainties and the Sum-of-Squares Identity 


As with unweighted data, the weighted sum-of-squares (SSQ) identity is the crucial underpinning of 
weighted linear regression (aka “generalized linear regression”). Before considering uncertainties, recall 
our unweighted sum-of-squares identity in vector form: 
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(raw, unweighted) SST = SSA+SSE or Y” =Ymoa +2” 


(9.7) 
where €=residual vector, y~ =yey, etc, Vinod = Doky. 
k=l 


Recall that the dot products are real numbers. Also, by construction, ¢ is orthogonal to f;, ¢-f; = 0, and the 
SSQ identity hinges on this. 


We derive the weighted theory almost identically to the unweighted case. All of our vectors remain the 
same as before, and we need only redefine our dot product. The weighted dot-product weights each term in 
the sum by wi: 


n 
a-b= » w,a,b;, wins, a’ =ara (weighted dot-product) . 


Such generalized inner products are common in mathematics and science. They retain all the familiar, useful 
properties; in particular, they are bilinear, and in this case, commutative. Then the weighted SSQ identity 
has exactly the same form as the unweighted case (proved shortly): 


(raw) SST =SSA+SSE: y? =Yimog t®’- (9.8) 


Note that SSE is the y? parameter we minimize when fitting. Written explicitly as summations, the weighted 
SSQ identity is: 


n n n 
2 2 
(raw) wy? => w (bf) + wi (9; - oe fed) [Schwa 1998, eq 4 p832]. 
i=] i=] i= 
S| 
SST SSA SSE 


If this identity still holds in the weighted case, then most of our previous (unweighted) work remains valid. 
We now show that it does hold. We start by noting that even in the weighted case, ¢-f, = 0. The proof comes 
from the fact that SSE is a minimum w.r.t. all the bx: 


OSSE P 
ab, =0= as — = ames 2 WE; ‘Ob at i Dab «| 


The only term surviving the derivative is when m = k: 


n 
O= >) wefg =e, k=L..p. 
i=l 


Therefore, per (8.3), the weighted raw sum-of-squares identity holds. 


The identity is a little simpler in terms of ymnoa: 


n n n 
2 2 2 
Venn * =) Dnfin > (raw) yi = > Ymodi + owe; 
| i=] i=l 


m=1 


Also as before, if we include a constant bo fit parameter, then the ANOVA SSQ identity holds: 


n 


n n 
ANOVA: > (9-F) = Yi (Ymoae - x) +o we77 
i=l 


i=l i=l 


Recall that y is the weighted average (9.1). With a bo fit parameter, we have: 
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n 
ef) =0= > WE; where fo; =1 Vi. 
i=l 


Thus the weighted sum of residuals is zero. Then the ANOVA SSQ proof is only slightly more involved 
than the unweighted case: 


n n 


YuiGi-5y =>; ((Smoai-¥)+6:) 


i=l i=l 


n 2 n n n 
S 7 2 = 
- » Wj (Ymoa, ~ y) + > we; +2 > Ymnod ii ~ 2Y > Wié; 
i=l i=l i=l i=l 
| L___] 


0 0 


Distribution of Weighted Orthogonal Fit Coefficients in Pure Noise 


As in the unweighted case, in hopes of hypothesis testing, we need the distribution of the b; in pure noise 
(no signal). Here again, if a fit function is orthogonal (w.r.t the weights) to all other fit functions, then its 
(least-chi-squared) fit coefficient is given by a simple correlation. Le., for a given k: 


n 
ihe ti); 
_ fey = i=l 


> Wit Gi) 
i=l 


ff; =O for all j #k > by 


For convenience, we now further restrict ourselves to a normalized (over the {t;}) fit-function, though this 
imposes no real restriction, since any function is easily normalized by a scale factor. Then: 


n n 
wAGY =1 > & = whey, (9.9) 
i=l i=l 

Now consider an ensemble of samples (sets) of noise, each with the same set of {(ti, uj)}, and each 


producing a random b,. In other words, the b; are RVs over the set of possible samples. We now find var(b,) 
and <b;>. Recall that the variance of a sum (of uncorrelated RVs) is the sum of the variances, and the variance 


of k times an RV = k’var(RV). All the values of w; and fi(t;) are constants, and var(y;) = u? = w7'; therefore 


taking the variance of (8.6): 


vary) = ow) fet)? var) = Do wife(G)? =I, (9.10) 
1 i=l 


i=l oa 
w; 


This is different than the unweighted case, because the noise variance o* has been incorporated into the 
weights, and therefore into the normalization of the f;. 


We now find the average <b,>. Taking the ensemble average of (8.6): 


(dy) = bs a = [Sto . 


Since the sum has no simple interpretation, this equation is most useful for showing that if the noise 
(measured in y;) is zero-mean, then b; is also zero-mean: <by> = 0. However, if the summation happens to 
be zero (i.e., the predictor is zero-mean), then even for non-zero mean noise, we again have <b,> = 0. 
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Furthermore, any weighted sum of gaussian RVs is a gaussian; therefore, if the y; are gaussian (zero- 
mean or not), then b; is also gaussian. 


Non-Correlation of Weighted Fit Coefficients in Uncorrelated Noise 


We now consider the correlation between two fit coefficients, by and b» (again, over multiple samples 
(sets) of noise), when the fit-functions f; and fm are orthogonal to each other, and to all other fit-functions. 
(From the homoskedastic equivalent measurements, we already know that by and b,, are uncorrelated. 
However, for completeness, we now show this fact directly from the weighted data.) For convenience, we 
take f;, and fi, to be normalized: f2 = f,,’ = 1 (recall that our dot-products are weighted). 


As in the unweighted case, we derive the covariance of by and bm from the bilinearity of the cov( ) 
operator. We start with the formula for a fit-coefficient of a normalized fit-function that is orthogonal to all 
others, (8.6), and use our algebra of statistics: 


n n 
cov(by ,b,) = cov (fi; *y.f,, ey) = cov > Whi Yir ny WS inj dj 
i=l j=l 
Again, all the w;, w;, fx, and f;, are constants, so they can be pulled out of the cov( ) operator: 


n nN 


cov(Dy Dn) = wi fawj fn cov(y;.y;) : 


i=l j=l 


It is quite comment that the noise between any two samples is uncorrelated (and usually independent). Then 
the y; are uncorrelated. Hence, when i #/j, cov(ji, y;) = 0, so only the i = 7 terms survive, and the double sum 
collapses to a single sum: 


n 
cov(b, ,b,,) = y wir Ff cov(y;, y;)- 
1 


i=l _ 
W; 


Now cov(y;, yi) = var(y) = u7 = wi!, so: 


n 
cov(by Pn) = > Wife fini =f, -f, =0. (9.11) 
i=l 


This is true for arbitrary distributions of y;, even if the y; are nonzero-mean. 


The Weighted Total Sum-of-Squares (SST) in Pure Noise 
The weighted total sum of squares is: 


n 
raw: SST =yey = Dwiyr 
i=l 


n n 
ANOVA: SST =(y-3) =) w;(y;-F), where Fae wy. 
i=l I jst 


For gaussian noise, if the uncertainties were perfect, then each term in the sum above would be € y”;. But 
real uncertainties are just estimates, so in contrast to the unweighted case, the weighted SST (taken over an 
ensemble of samples) is not exactly ay” RV. It has no general PDF. However, if the uncertainties are fairly 
reliable, we can often approximate SS7’s distribution as 7? with n dof (raw), or n— 1 dof (ANOVA), especially 
when 7 is large. 
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The Weighted Model Sum-of-Squares (SSA) in Pure Noise 
Recall that the model can be thought of as a vector, Ymod = {¥moa}, and the basis functions for that vector 


are the fit-functions evaluated at the sample points, fi = {fni}. Then: 


Dp 
Y mod = Onin : 


m=1 


The f,, may be oblique (non-orthogonal), and of arbitrary normalization. However, as in the unweighted 
case, there exists an orthonormal basis in which Ymoa may be written (just like eq. (8.7)): 


Dp 
Y mod = > oes where g,,, = orthonormal basis, c,, =coefficients in the g basis . 


m=1 


We’ve shown that the c,, are uncorrelated (see (9.11)), with var(cm) = 1 (using raw weights, see (9.10)). Then 
(recall that the dot-products are weighted): 


2 
Dp 
SSA = Fie = p rte - 


m=1 
Since the g,, are orthogonal, all the cross terms in the square are zero. The g,, are normalized, so: 


Pp 


Pp Pp 
Vmod” = Dy (CmBm) = yom? Bm = Don (9.12) 
1 


m=1 m=1 m=1 


Therefore, Ymoa” is the sum of p uncorrelated RVs (the cn”). We find SSA = ymoa’ using the general formula 
for the average of the square of an RV (7.2): 


. = cs ? + var(C,) = (es \ +1> SSA = (Ymod”) = p (Cn a Dr 


m=1 


where var(cm) comes from (9.10). This is true for any distribution of uncorrelated noise, even non-zero- 
mean. In general, there is no simple formula for var(Ymoa). 


If the noise is zero-mean, then each <c,,> = 0, and the above reduces to: 
2 : . 
(¥ nod ) =p (zero-mean noise) . 


If the noise is zero-mean and gaussian, then the cm are zero-mean uncorrelated gaussian RVs. This is a 
well-known condition for independence [ref ??], so the cm are independent, gaussian, with variance 1 (see 
(9.10)). Then (9.12) tells us that moa is a chi-squared RV with p degrees of freedom: 


(raw) Visa = SSAe Z (zero-mean gaussian noise) . 


We developed this result using the properties of the orthonormal basis g»,, but our model Ymoa, and therefore 
Ymod’, are identical in any basis. Therefore, the result holds for any p fit-functions that span the same model 
space, even if they are oblique (i.e. overlapping) and not normalized. 


The Residual Sum-of-Squares (SSE) in Pure Noise 


For zero-mean gaussian noise, in the weighted case, we’ve shown that SSA is y”, distributed, but SST is 
not exactly vy. Therefore, SSE is not, either. However, for large n, SST and SSE are approximately x? 
distributed, with the usual (i.e. equal uncertainty case) degrees of freedom assigned: 
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n n n 
raw: pS W; ie = > W; Ymod - + > WiE; (zero-mean gaussian) 
i=l i=l i=l 
| | ee 
SST dof =n SSA dof =p SSE dof =n-p 
n 2 n 2 n 
ANOVA: > w;(¥,-Y) = > W; (ic i- y) + », w,é;" (zero-mean gaussian) 
i=l i=l i=l 
———— ee | es) 
SST dof =n-1 SSA dof =p-1 SSE dof =n-p 


Hypothesis Testing a Model in Linear Regression with Uncertainties 


The approximation that SST and SSE are almost y’ distributed allows the usual F-test as an approximate 
test for detection of a signal, i.e. testing whether the fit actually matches the presence of the model in the 
data. However, the F critical values will be approximate, and therefore so will the p-value. In fact, the 
approximation is usually worst in the tails of the distribution, right around where your p-value is, so precise 
p-values cannot be determined from an F-test. 


In many cases, numerical simulations (shuffle simulations) can provide more reliable critical values than 
the theoretical gaussian F critical values, for 2 reasons: even the gaussian-noise F-values are only 
approximate (as described), and because the noise itself is often significantly non-gaussian. 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 165 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


10 Practical Considerations for Data Analysis | 


Rules of Thumb 


We present here some facts and theory, and also some suggestions, for processing data. Some of these 
suggestions might be called “rules of thumb.” They will be appropriate for many systems, but not all systems. 
Rules of thumb can often help avoid pitfalls, but in the end, there is no substitute for understanding the details 
of your own system. 


This chapter is more subjective than most others. Generally, there are no hard-and-fast rules for 
“optimum” data processing. The better you understand your system, the better choices you will be able to 
make. 


Note that: 


Signal to Noise Ratio (SNR) 


Some systems lend themselves to simple parameters describing how “clean” the measurements (or 
signal) are, or equivalently, how “noisy.” For example, communication systems, or a set of measurements, 
with additive noise can often be represented (to varying degrees of accuracy) by a Signal-to-Noise ratio, or 
SNR. In contrast, some other systems cannot be reasonably reduced to such a single parameter. We consider 
here systems with additive noise. We define noise as random, though it may have a wide variety of statistical 
properties, e.g. gaussian, uniform, zero-mean, or biased. 


In addition to noise, which is random, measurements are often distorted by deterministic effects, such as 
nonlinearities in the system. If you know the distortion operation, you can sometimes correct for it. Any 
residual (uncorrected) distortion usually ends up being treated as if it were noise. (I once consulted for a 
communication company that was working to correct for nonlinear distortion that had previously been 
essentially ignored, and so treated as if it were noise. By correcting for the deterministic part, we were able 
to get a higher signal-to-noise ratio, and therefore better performance, than other systems.) 


The term “signal to noise ratio” is widely used, and often abused. In data analysis, “SNR” has many 
definitions, so SNR quotes, by themselves, cannot be interpreted. At best, SNR is always an estimate; one 
can never perfectly separate signal and noise. If you could, you would recover the signal perfectly, and at 
that point, you have eliminated all noise, and your SNR is infinite. 


By far the most widely used definition of SNR, and we think the most appropriate, is signal “energy” 
divided by noise “energy”. In this context, “energy” simply means “sum of squares” (SSQ): 


SSQ(signal) 


SNR = : 
SSQ(noise) 


For zero-mean signals or noise, the sum of squares is proportional to the variance, so sometimes you'll see 
SNR written as the ratio of two variances. 


SNR lies between 0 and oo: 0 means no signal (all noise), and co means all signal (no noise). For many 
systems, their performance can be well determined from SNR alone. This computation can be harder than it 
looks, though, because there is not always a clear definition of “signal” and “noise.” However, in many 
common cases, there are generally accepted definitions that scientists and engineers should adhere to. We 
describe some of those cases below. 


SNR is fundamentally a dimensionless quantity, but is often quoted in decibels (dB), a logarithmic scale 
for dimensionless quantities: 
SSO(signal) _ 20 RMS (signal) 


log 


SNR ,, =10log,, SNR =10lo 
= a P10 Snowe) 10° RMS (noise) 
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An increment of 3 dB corresponds to a factor of 2 change in SNR. 
Any computation of SNR is necessarily an estimate, 
because SNR is itself somewhat corrupted by noise. 
Computing SNR From Data 


To directly apply the above definition, we need two sets of data: one we call “signal” or “model,” and 
another we call “noise” or “residuals.” Then the computations of SSQ are straightforward. In all cases, we 
start with a set of nm measurements, y;, i= 1... n. If we can somehow separate the data into two sequences, 
model and noise, then we compute the SNR above as: 


_ SSQ(Ymod,i) : model 
SSQ(E;) noise 


Define: y; = Ymoai t+ é- Then: SNR 


n 


where SSQ(Vmod,i) = » agar) ’ SSQ(E;) = pies , 


i=l isl 


One simple way to estimate a “signal” is to filter the data, either with analog filters, or the digital 
equivalent thereof. Digitally, one can also use more specialized filters, such as a “median filter.” In all cases, 
one takes the filtered output as the “signal.” Another way to estimate a signal is to fit a model to the data. 


SNR for Linear Fits 


When we fit a model to data with a linear least-squares fit (i.e., minimum y7), the total SSQ in the 
measurements partitions cleanly into “model” and “noise.” This is a form of the sum-of-squares identity (see 
elsewhere for details of linear fitting): 


SSQ(y;) = SSQ(Vmoa,i) + SSQCE;) « 
Then SNR is well-defined, and simple. However, note that any “misfit” is counted as noise. 


An important factor in estimating SNR is over what range you take the fit, and necessarily then, measure 
the y’. If you include regions you don’t care about, it will make the 7’ less relevant to you. 


For linear least-squares fits, we can use the SSQ identity to write SNR in other ways: 


_ SSQ(signal) _ SSQ(signal) 
~ §SO(noise) SSOQ(data) — SSO(signal) 


SNR 


SNR for Nonlinear Fits 


As shown elsewhere, for a nonlinear fit, the model values are not orthogonal to the residuals, and the 
SSQ identity does not hold: 


SSQ(y;) # SSQ( Vmoa,i) + SSQ(E;) (nonlinear fit) . 


However, the above definition for SNR is still useful as an approximation (remember: all estimates of SNR 
are corrupted by noise). A “reasonable” fitting procedure will produce both a “signal” and “noise” that each 
have less SSQ than the original data. Then the SNR still lies between 0 and ©. 


Other Definitions of SNR 


We discourage any other uses of the term SNR, but rarely, it is computed as the ratio of RMS values (or 
standard deviations), rather than SSQ (or variances). Still other more specialized definitions also exist. 
However: 


SNR should always be dimensionless, and lie in the interval [0, oo]. 
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Spectral Method of Estimating SNR 


In some cases, there is a way to estimate the SNR without explicitly separating a “signal” from the data. 
As before, suppose you have n points of measured data, which consist of an unknown signal plus noise: 


Define: y;= 5; + & , i=Lin. 


signal noise 


If you know the approximate Fourier amplitude (or equivalently, power) spectrum of the signal, and the noise 
amplitude spectrum, you can estimate the SNR. This Fourier amplitude often exists in an abstract Fourier 
space, with little or no physical meaning. Note that you don’t need the phase part of the Fourier spectrum, 
and (infinitely) many different signals have the same amplitude spectrum. 


i i ideal 
ideal spectrum 
signal 
measured measured 
signal a- Spectrum 
noise 
spectrum 
x _——> ei 
_Signal f,._, 
@ (b) Fron band Jie 


Figure 10.1 Estimating SNR from an amplitude spectrum. (a) Ideal and measured signal. 
(b) Fourier transform, with white noise. The noise spectrum is assumed known from other sources. 


Figure 10.1 shows an example of a measured signal, and it discrete Fourier transform (DFT). The signal 
is s(x), but it is measured at discrete intervals x;, so: 


8; = $(3;), and Yj; = 8; + &;. 


Any measurement includes noise. Suppose we know our noise spectrum (say, from other 
measurements). E.g., the noise spectrum is often white (constant). (In fact, in the common case where all 
the noise contributions, ¢;, are uncorrelated, then the noise is white.) From the measurement spectrum, we 
find the band (in this abstract Fourier space) where the signal “energy” resides. This band is wherever the 
measurements are significantly above the noise floor (Figure 10.1b). The energy in this band is signal + 
noise. 


Knowing the “band of interest” of our signal, we can, in principle, filter our measurements to keep only 
(abstract) frequencies in that band. That will clean up our measurements somewhat, improving their signal- 
to-noise ratio. Then in terms of signal-energy and noise-energy: 


_ signal _ signal + noise 


SNR iE 


noise noise 


The signal + noise energy is the sum of the squares of the discrete Fourier component amplitudes: 


; ; 2 : as 
signal + noise = » |S;,| where S, = Fourier coefficients of transformed data . 
k=signal 
band 


The noise energy is found from our outside source of noise power spectral density. For white noise, the 
spectrum is constant, so: 


noise = (signal bandwidth) (noise power spectral density) = (Frigh ~ flow) No ("energy"). 


For frequency-dependent noise spectra, No(f) in “energy” per Hz, we must integrate to find the noise energy: 
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Shigh 
noise = | us No(f) df ("energy"). 


Jlow 


Tip: be careful how you think about the abstract Fourier space. In one real-world example, a physicist 
measured the transfer function H(f) of an analog filter, and wanted to estimate the SNR of that measurement 
of H(f). A transfer function is a function of frequency, however, we must think of it as just a function of an 
independent variable (say, a Lorentzian function of f). Now, we take the Fourier transform of H(f), call it 
n(g). This transform exists in an abstract Fourier space: it is a function of abstract frequency g. We must 
distinguish the abstract frequency g of the transform of H(f) from the real frequency f of the transfer function 
H() itself. In this example, the noise floor in g-space was a known property of the measurement device (a 
network analyzer), so he could estimate his signal-band noise-energy from the noise floor. 


Fitting Models To Histograms (Binned Data) 


Data analysis often requires fitting a function to binned data, for example, fitting a probability 
distribution (PDF) to a histogram of measured values. While such fitting is very commonly done, it is much 
less commonly understood. There are important subtleties often overlooked. This section assumes you are 
familiar with the binomial distribution, the v7 “goodness of fit” parameter (described earlier), and some basic 
statistics. 


The general method for fitting a model to a histogram of data is this: 

e Start with n data points (measurements), and a parameterized model for the PDF of those data. 
e _ Bin the data into a histogram. 

e Find the model parameters which “best fit” the data histogram. 


For example, suppose we have n measurements that we believe should follow a gaussian distribution. A 
gaussian distribution is a 2-parameter model: the average, , and standard deviation, o. To find the uw and o 
that “best fit” our data, we might bin the data into a histogram, and then fit the gaussian PDF to it (Figure 
10.2). (Of course, for a gaussian distribution, there are better ways to estimate and o, but the example 
illustrates the point of fitting to a histogram. Often, a realistic model is more complicated, and there are no 
simple formulas to compute the model parameters. We use the gaussian as an example because it is familiar 
to many.) 


model PDF 
4 


fit error 


measured bin count, c; 
predicted bin count, model; 


measurement 


Pee 


x 


SSS oS Soo eo an Ss 


Figure 10.2 Sample histogram with a 2-parameter model PDF (u and o). The fit model is gaussian 
in this example, but could be any PDF with any parameters. 


We must define “best fit.” Usually, we use the y” (chi-squared) “goodness of fit” parameter as the figure 
of merit (FOM). The smaller y’, the better the fit. Fitting to a histogram is a special case of general 7’ fitting. 
Therefore, we need to know two things for each bin: (1) the predicted (model) count, and (2) the uncertainty 
to use in the y*. We now find these two quantities. 


Chi-squared For Histograms: Theory 


We develop here the y” figure of merit for fitting to a histogram. A sample is a set of n measurements 
(data points). In principle, we could take many samples of data. For each sample, there is one histogram, 
i.e., there is an infinite population of samples, each with its own histogram. But we have only one sample: 
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the one we measured. The question is, how well does our one histogram agree with a given population (PDF) 
of data measurements. 


Before the y’ figure of merit for the fit, we must first understand the statistics of a single histogram bin, 
from the population of all histograms that we might have produced from different samples. The key point is 
this: given a sample of n data points, and a particular histogram bin numbered i, each data point in the sample 
is either in the bin (with probability p;), or it’s not (with probability (1 - p;) ). Therefore, the count in the i 
histogram bin is binomially distributed, with some probability p;, and n “trials.” (See standard references on 
the binomial distribution if this is not clear.) Furthermore, this is true of every histogram bin: 


[Aside: when counting events in a fixed time interval, one gets a Poisson distribution of counts. That is not 
our case, here. ] 


Recall that a binomial distribution is a discrete distribution, i.e. it gives the probability of finding values 
of a natural-number random variable (a count of something); in this case, it gives the probability for finding 
a given number of counts in a given histogram bin. The binomial distribution has two parameters: 


p = the probability of a given data point being in the bin 
n = the number of data points in the sample, and therefore the number of “trials” in the binomial 
distribution. 
The binomial distribution has average, a, and variance, o given by: 
a=np, o =np(1— p) (binomial distribution) . (10.1) 
A general y’ indicator is given by: 


Noins 2 
2 _ ¥ (c; — model; ) ‘he 4 
v= —— where c; = the measured count in the 7” bin, 
: Oo; 
i=l 1 


model; = the model average count in the i” bin 


o? = the model variance of the i” bin 


Chi-squared For Histograms: Practice 
Computing the 7” figure of merit for the fit typically comprises these steps: 
e Given the trial fit parameters, compute a (usually unnormalized) PDF for the parameters. 


e Normalize the model PDF to the data: compute the scale-factor required to match the number 
of measured data points. 


e Compute the model variance of each bin (using the above two results). 
e Compute the final y”. 
We now consider each of these steps. 


Compute the unnormalized model PDF: We find the model average bin-count for a bin from the 
model PDF: typically, the bins are narrow, and the absolute probability of a single measurement being in a 
bin is just: 
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Pr (being in bin i) = p, = S'pdfy (x;) Ax, 
where x,=bin-center (narrow bins) 
S'= possibly unknown normalization factor 


pdf, (x,) = unnormalized model pdf at bin center 


If the approximation pdfx(x)Ax is too crude, one can use any more sophisticated method to better integrate 
the PDF over the bin width to find p;. 


Then the model average bin-count for n measurements is Pr(being in bin) times n, so we define a scale 
factor S to include both the (possibly unknown) normalization factor for the PDF, as well as the scaling for 
the number of measurements n: 


model, = S pdf y (x;) Ax; (narrow bins) 
where S =as-yet unknown scale factor 
x, =bincenter, Ax, = bin width 
pdf, (x,) = unnormalized model pdf at bin center 
Note that Ax; need not be constant across the histogram; in fact, it is often useful to have Ax; wider for 
small values of the PDF, so the bin-count is bigger (more on this later). 


Normalize the model PDF: The PDF computed above is often unnormalized, for two reasons: First, 
some models are hard to analytically normalize, but easy to calculate as relative, probability densities, and 
therefore we use an unnormalized PDF. Second, most histograms of measured data cover only a subset of 
all possible measured values, and so even a normalized PDF is not normalized over the restricted range of 
the histogram. 


Normalizing the model PDF is trivial: scale the unnormalized PDF such that the sum of the model bin- 


counts equals the actual number of data points in the histogram: 


Npins Noins 
n= >) model; = DS pdfy (x) Ax, a s 
i=l i=l 


_ n 

7 Nopins 
> Pdf y (Xj) Ax; 
i=l 


(Note that for this step, we don’t need to know the actual normalization factor for the model PDF.) The 
model value for each bin is then: 


model, = S. pdf (x;) Ax; 


A common mistake is to include the scale parameter S as a fit parameter, instead of computing it exactly, 
as just described. Fitting for S makes your fits less stable, less accurate, and slower. (S would be a “nuisance 
parameter.”) In general: 


The fewer the number of fit parameters, the more stable, accurate, and faster your fits will be. 


Compute the model variance for each bin: When computing 7’, we are considering how likely are 
the data to appear for the given trial model. For some applications of 7, one uses the measurement 
uncertainties in the denominators of the y? terms. However, when fitting PDFs to histograms, the model 
itself tells you the uncertainty (variance) in the bin count. Therefore, the “uncertainty” in the denominator 
of the ” terms is that of the model. The exact variance of bin i comes from the binomial distribution (10.1): 


o; =np,(1—p,) where p,=model,/n. 


For a large number of histogram bins, Npins, the probability of being in a given bin is of order p; ~ 1/Npins, 
which is small. Therefore (though we don’t agree with it), people often approximate the model variance of 
the bin-count as: 
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o; =np, (1 - P;) = np, = model, (Nyins >> 1 => p, <<). 


However, conceptually, and for quick estimates, it is important to know that o; =~ model, . 


Compute the final 77: We now have, for each histogram bin, (1) the model average count, model;, and 
(2) the model variance of the measured count a7. We compute y? for the model PDF (given a set of model 
parameters) in the usual way: 


Noins ( — model, ° c; = the measured count in the i bin, 


where 


«th 


model; ,o7 = the model average & variance in the i” bin 


If your model predicts any variance of zero (happens when model; = 0 for some i), then y” blows up. This is 
addressed below. 


Reducing the Effect of Noise 


To find the best-fit parameters, we take our given sample histogram, and try different values of the pdf(x) 


parameters (in our gaussian example above, yw and o) to find the combination which produces the minimum 
2 


xX. 
Notice that the low count bins carry more weight than the higher count bins: y” weights the terms by 
1/o? ~ 1/model;. This reveals a common misunderstanding about fits to histograms: 


A fit to a histogram is driven by the tails, not by the central peak. This is usually bad. 


Tails are often the worst part of the model (theory), and often the most contaminated (percentage-wise) by 
noise: background levels, crosstalk, etc. Three methods help reduce these problems: 


e Limiting the weight of low-count bins 
e = Truncating the histogram 
e ~=Rebinning 


Limiting the weight: The tails of the model distribution are often less than 1, and often approach zero. 
This gives them extremely high weights compared to other bins. Since the model is probably inadequate at 
these low bin counts (due to noise, etc.), one can limit the denominator in the y” sum to at least (say) 1; this 
also avoids division-by-zero: 

2_ bins (c; - model; _ ; ah 
a F where c; = the measured count in thei 


i=l J 


bin 


o; if o7 >1 
1 otherwise. 


This is an ad-hoc approach, and the minimum weight can be anything; it doesn’t have to be 1. Notice, though, 
that each modified y’ term is still a continuous function of model;. This means 7’ is a continuous function of 
the fit parameters, which is critical for stable fits (it avoids local minima; see other considerations below). 


Note that even if the best-fit model has reasonable values for all the bin-counts, during the optimization 
algorithm, the optimizer may explore unreasonable model parameters on its way to the best fit. Therefore: 


Truncating the histogram: In addition to limiting the bin weight, we can truncate the histogram on the 
left and right sides to those bins with a reasonable number of measurements (not model counts), substantially 
above the noise (Figure 10.3a). [Bev p110] recommends a minim bin count of 10, based on a desire for 
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gaussian errors. I don’t think that matters much, since the y? parameter works reasonably well, even with 
non-gaussian errors. In truth, the minimum count completely depends on the noise level: to be meaningful, 
the bin-count must be dominated by the model over the noise. 


F ee PDF model PDF 


< 
(a) Xs Xf 


Figure 10.3. Avoiding noisy tails by (a) truncating the histogram, or (b) rebinning. The left 3 bins 
are combined into a single bin, as are the right 3 bins. 


Truncation requires renormalizing (adjusting the number of measurements, n): we normalize the model 
within the truncated limits to the measured data count within those same limits. That is, we redefine n: 


n = Ve where s=startbin#; f =end bin#; c; = measured bin count . 


i=s 


You might think that we should use the model, not the data histogram, to choose our truncation limits. 
After all, why should we let sampling noise affect our choice of bins? However, using the model fails 
miserably, because our bin choices change as the optimizer varies our parameters in the hunt for the optimum 
7’. Changing which bins are included in the FOM during the search causes unphysical jumps in y* as we 
vary our parameters, making many local minima. These minima make the fit unstable, and generally 
unusable. For stability: truncate your histogram based on the data, and keep it fixed during the parameter 
search. 


Rebinning: Instead of truncating, you can re-bin your data. Bins don’t have to be of uniform width 
[Bev p175], so combining adjacent bins into a single, wider bin with higher count can help improve signal- 
to-noise ratio (SNR) in that bin (Figure 10.3b). Note that when rebinning, we evaluate the theoretical count 
as the sum of the original (narrow) bin theoretical counts. In the example of Figure 10.3b, the theoretical and 
measured counts for the new (wider) bin 1 are: 


model, =1.2+3.9+10.8 =15.9 and c, =34+3+8=14. 


Other Histogram Fit Considerations 


Slightly correlated bin counts: Bin counts are binomially distributed (a measurement is either in a bin, 
or it’s not). However, there is a small negative correlation between any two bins, because the fact that a 
measurement lies in one bin means it doesn’t lie in any other bin. Recall that the y* parameter relies on 
uncorrelated errors between bins, so a histogram slightly violates that assumption. With a moderate number 
of bins (> ~15 ??), this is usually negligible. 


Overestimating the low count model: If there are a lot of low-count bins in your histogram, you may 
find that the fit tends to overestimate the low-count bins, and underestimate the high-count bins (Figure 10.4). 
When properly normalized, the sum of overestimates and underestimates must be zero: the sum of the model 
predicted counts is fixed at n. But since low-count bins weigh more than high-count bins, and since an 
overestimated model reduces 7? (the model value model; appears in the denominator of each 7? term), the 
overall y” is reduced if low-count bins are overestimated, and high-count bins are underestimated. 
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underestimated 


model PDF 
overestimated 


Figure 10.4 y’ is artificially reduced by overestimating low-count bins, and underestimating high- 
count bins. 


This effect can only happen if your model has the freedom to “bend” in the way necessary: i.e., it can be 
a little high in the low-count regions, and simultaneously a little low in the high-count regions. Most realistic 
models have this freedom. This effect biases your model parameters. If the model is reasonably good, it can 
cause the reduced-y’ to be consistently less than 1 (which should be impossible). 


I don’t know ofa simple fix for this. It helps to limit the weight of low-count bins to (say) 1, as described 
above. However once again, a better approach is to minimize the number of low-count bins in your 
histogram. 


Noise not zero mean: In histograms, all bin counts are zero or positive. Any noise will add positive 
counts, and therefore noise cannot be zero-mean. If you know the PDF of the noise, then you can put it in 
the model, and everything should work out fine. However, if you have a lot of unmodeled noise, you should 
see that your reduced-y’ is significantly greater than 1, indicating a poor fit. Some people have tried playing 
with the denominator in the y” sum to try to get more “accurate” fit parameters in the presence of noise, but 
there is little theoretical justification for this, and it lends itself to ad-hoc tweaking to get the answers you 
want. Better to model your noise, and be objective about it. 


Non-7? figure of merit: One does not have to use 7’ as the fit figure of merit. If the model is not very 
good, or if there are problems as mentioned above, other FOMs might work better. The most common 
alternative is probably “least-squares,” which means minimizing the sum-squared-residual: 


N bins 
SSE = » (c,- model;)” (sum-squared-residual) . 
i=l 


This is like y* where the denominator in each term in the sum is always 1. 


Data With a Hard Cutoff: When Zero Just Isn’t Enough 


~ model 


0 25450175 100 
12.5: 37.5 62.5 


Figure 10.5 Binning data with a lower bound of zero creates a zero-bin of only half width. 


Consider a measured quantity with zero as an absolute lower bound. For example, the parameter might 
be the delay time from cause to effect. Suppose we measure it in ps, and (for some reason) we want to bin 
the measurements into 25 ps bins. Following the advice above, we’d put our bin centers at 0 ps, 25 ps, 50 
ps, etc., so our bin boundaries are 12.5 ps, 37.5 ps, 62.5 ps, etc. (Figure 10.5). However, in this case, the 
zero-bin is really only 12.5 ps wide. Therefore, when computing the model bin-count for the zero bin from 
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the model PDF, you must use only half the standard bin-width. That’s all there is to it. Despite this slight 
accommodation, we think it’s still worth it to keep the bin centered at zero. 


Filtering and Data Processing for Equally Spaced Samples 


Equally spaced samples (e.g., in time or space) are often “filtered” in some way as part of data reduction 
or processing. We present here some general guidelines that usually give the best results. We recommend 
that before deviating from these guidelines, you clearly justify the need to do so, and discuss the justification 
in any presentation of your data analysis (e.g., in your paper or presentation). 


Finite Impulse Response Filters (aka Rolling Filters) and Boxcars 


Finite Impulse Response (FIR) filters take a sequence of input data, and produce a (slightly shorter) 
sequence of output data which is “filtered” or “smoothed” in a particular way. Their primary uses are: 


e In real-time processing (or a simulation of indefinite length), where an indefinitely long 
sequence of data must be continuously processed. 


e To crudely smooth data for a visual graph, where the smoothed data are not to be used for 
further analysis. 


Note that fitting procedures should usually be done on the original data, without any pre-filtering. Most 
fitting procedures inherently filter the noise, and pre-filtering usually degrades the results. Therefore, in data 
post-processing, where the entire data set is available at once, FIR filters (including “boxcar” filters) are 
rarely needed or used. 


FIR Example: Consider a sequence of data uj, j = 1... n. In an FIR filter, the output sample at index i 
is a weighted sum of nearby input samples (Figure 10.6). A simple, widely used filter is: 
1 


1 1 


The coefficients, or taps, are the three weights 4, %, and %4. This is a 3-tap filter. Most FIR filters will be 
symmetric in their coefficients (same backwards as forwards), because asymmetric filters not only give an 
erratic spectrum, but also introduce phase shifts that further distort the data. 


je fe | ys | oy [ger | yo | | 
x x x x x 
weehts [2 [wa | mo | mi | m | 
weights wo | wi | Wo | Wy | Wo 
sum 


output 
samples 


input 
samples 


Figure 10.6 Example of a 5-tap FIR filter. 


Definition of FIR filter: We define / to be distance to the farthest sample included in the weighted sum. 
In the 3-tap filter example above, /= 1. By definition, an FIR filter produces outputs y; according to: 


l 
y= > Wi jk where w, = weights, or "taps" . 
k=-I 


The number of taps is 2/ + 1. The weights are usually normalized so they sum to 1. We can think of an FIR 
as sequencing through the index j. Almost all FIR filters require input samples both ahead of and behind the 
current sample. Therefore, in real-time processing, and FIR filter introduces a delay of / samples before it 
produces its output. This is usually benign. 


FIR (and TR) filters are linear, so Fourier analysis is appropriate. 
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Use Smooth Filters (not Boxcars) 


“Boxcar” filters are a special case of FIR filters where all the coefficients are the same. 


Boxcar filters are rarely appropriate, and we discourage their use. 
Far better filters are given here, so you can use them effortlessly. 


A nice set of smooth filter coefficients turns out to be the odd rows of Pascal’s Triangle (which are also the 
binomial coefficients). Figure 10.7 shows, as an example, the nice quality of the frequency response for the 
9-tap filter. In the table below, the normalization factors are in parentheses before the integer coefficients: 


= 3: (4) 121 
1=5 (1/16) (14641) 

1=7: (1/64) (1 6 15 20 15 6 1) 

1=9 (1/256) (1 8 28 56 70 56 28 8 1) 


H(f) H(f) 


Figure 10.7 (Left) Frequency response of a 9-tap smooth filter is well behaved. 
(Right) Frequency response of a 9-tap boxcar filter is erratic, and sometimes negative. 


(To be supplied) Be careful to reproduce the tap coefficients exactly, and don’t approximate with so- 
called “in-place” filters. Seemingly small changes to a filter can produce unexpectedly large degradations in 
behavior. 


Problems With Boxcar Filters 


Boxcar filters suffer from an erratic frequency response (Figure 10.7). This colors the noise, which is 
sometimes harmful, and almost never useful. Furthermore, some frequencies are inverted, so they come out 
the negative of how they went in (between f = 0.11 - 0.21, and 0.33 - 0.44). Also, it’s easy to mistakenly use 
an even number of taps in a boxcar, which makes the result even worse by introducing phase distortion. 
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11. Fourier Transforms and Digital Signal Processing 


Signals, noise, and Fourier Transforms are an essential part of much data analysis. It is a deep and broad 
subject, in which we can here establish only some foundational principles. The subject is, however, rife with 
misunderstandings and folklore. Therefore, we here also dispel some myths. For more specialized 
information, one must consult more specialized texts. 


This section assumes you are familiar with complex arithmetic and exponentials, and with basic 
sampling and Fourier Transform principles. In particular, you must be familiar with decomposing a function 
into an orthonormal basis of functions. Understanding that a Fourier Transform is a phasor-valued function 
of frequency is very helpful, but not essential (see Funky Electromagnetic Concepts for a discussion of 
phasors). 


We start with the most general (and simplest) case, then proceed through more specialized cases. We 
include some important (often overlooked) properties of Discrete Fourier Transforms. Topics: 


e Complex sequences, and complex Fourier Transform (it’s actually easier to start with the complex 
case, and specialize to real numbers later) 


e Sampling and the Model of Digitization 

e Even number of points vs. odd number of points 

e Basis Functions and Orthogonality 

e Real sequences: even and odd # points 

e Normalization and Parseval’s Theorem 

e Continuous vs. discrete time and frequency; finite vs. infinite time and frequency 


e Non-uniformly spaced samples 


Brief Definitions 


Fourier Series represents a periodic continuous function as an infinite sum of sinusoids at discrete 
frequencies: 


= ; S, are complex (phasors), t = time 
s(t) = Dy 52°", where : 
ka ff, =1/ period (in cycle/s or Hz), @, =27 f, (in rad/s) 
fi = I/period, the lowest nonzero frequency, is called the fundamental frequency. fo = 0, always. 
Fourier Transform (FT) 
represents a continuous function as an integral of sinusoids over continuous frequencies: 
s(t) = } SQQafrei2** af = x | S(@)e do, where  S(.)is complex . 

co IT 4-00 
We do not discuss this here. The function s(t) is not periodic, so there is no fundamental frequency. S(@) is 
a phasor-valued function of angular frequency. 


Discrete Fourier Transform (DFT) 
represents a finite sequence of numbers as a finite sum of sinusoids: 


k =frequency index 
-1 ; 

_ 5 speltatens ae S;, are complex (phasors), 
: k=0 j =0,...n—1= the sample index, 


J, =1/ period (in cycle/s), @, =27f, (in rad/s) 
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The sequence s; may be thought of as either periodic, or undefined outside the sampling interval. As in the 
Fourier Series, the fundamental frequency is 1/period, or equivalently 1/(sampled interval), and fo = 0, always. 


[Since a DFT essentially treats the input as periodic, it might be better called a Discrete Fourier Series (rather 
than Transform), but Discrete Fourier Transform is completely standard. ] 


Fast Fourier Transform (FFT) 
an algorithm for implementing special cases of DFT. 


Inverse Discrete Fourier Transform (IDFT) 
gives the sequence of numbers s; from the DFT components. 


Model of Digitization and Sampling 


All realistic systems which digitize analog signals must comprise at least the components in Figure 11.1. 


ce a Anti-alias Analog to (tnd 
0123456789... 


Low Pass Digital 
analog Filter (LPF) filtered analog Converter digital 
signal (ADC) samples, s; 


signal 


sample clock, 


peer 


Figure 11.1 Minimum components of a Digital Signal Processing system, with uniformly spaced 
samples. 


In this example, the output of the digitizer is a sequence of real numbers, s;. Other systems (such as coherent 
quadrature downconverters) produce a sequence of complex numbers. 


Sampling Does Not Produce Impulses 


ee eee 


These notions are not true, and can be misleading [O&S p8b]. Note that a single impulse (in time) has 
infinite power. Therefore, a sum (sequence) of such impulses also has infinite power. In contrast, the original 
signal, and the sequence of samples, has finite power. This suggests immediately that samples are not 
equivalent to a series of impulses. 


Nonetheless, there is an identity that involves impulse functions, which we discuss after introducing the 
DFT. 


Complex Sequences and Complex Fourier Transform 


It’s actually easier to start with the complex case, and specialize to real numbers later. Given a sequence 
of n complex numbers s;, we can write the sequence as a sum of sinusoids, i.e. complex exponentials: 
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Inverse Discrete Fourier Transform: 
n-| 
i2n(k/n) j : . . 
= DV Ske a ni where j =0,...n—1is the sample index 
k=0 


u = the frequency of the ee component, in cycle/sample 
n 


S;, = the complex frequency component (phasor) 


Note that there are n original complex numbers, and n complex frequency components, so no information is 
lost. 


The Discrete Fourier Transform is exact, unique, and invertible. 


In other words, this is not a “fit.” 


The above equation forces all normalization conventions. We use the simple scheme wherein a 


function equals the sum of its components (with no factors of 27 or anything else). 


Often, the index j is a measure of time or distance, and the sequence comprises samples of a signal taken 
at equal intervals. Without loss of generality, we will refer to j as a measure of “time,” but it could be 
anything. Note that the equation above actually defines the Inverse Discrete Fourier Transform (IDFT), 
because it gives the original sequence from the Fourier components. [Mathematicians often reverse the 
definitions of DFT and IDFT, by putting a minus sign in the exponent of the IDFT equation above. Engineers 
and physicists usually use the convention given here. ] 


The sample index j, and the frequency index k, are dimensionless. Then the frequency amplitudes have 
the same units as the samples: [S;] = [sj]. If we think of the sample index as time, then the frequencies are in 
rad/s. 


Each number in the sequence is called a sample, because such sequences are often generated by sampling 
a continuous signal s(t). For n samples, there are n frequency components, S;, each at normalized frequency 
k/n (defined soon); see Figure 11.2. 


basis 
frequencies 
Complex Frequency 
fundamental Components 
5; frequency 
- 
complex samples 
1 Itt 1k 
0123456789 / 012345678 9 
(ee 
signal period, aka sample interval 


Figure 11.2 Samples in time, and their frequencies. For simplicity, the samples, sinusoids, and 
component amplitudes are shown as real, but in general, they are all complex valued. 


Note that there are a full n sample times in the sample interval (aka signal period), not (n — 1). 


The above representation is used by many DFT functions in computer libraries. 


Also, there is no need for any other frequencies, because k = 10 has exactly the same values at all the 
sample points as k= 0. If the samples are from a continuous signal that had a frequency component at k = 10, 
then that component will be aliased down to k = 0, and added to the actual k = 0 component. It is forever 
lost, and cannot be recovered from the samples, nor distinguished from the k = 0 (DC) component. The same 
aliasing occurs for any two frequencies k and k +n. 
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To avoid a dependence on n, we usually label the frequencies as fractions. For n samples, there are n 
frequencies, measured in units of cycles/sample, and running from f= 0 to f= (1 — 1/n) cycles/sample. The 
n normalized frequencies are: 


poe k=0,1,...n—-1, _ that is (h}={o2.2.2,.2Hf, 


n nnn n 


There is no f= 1, just as there is no k =n, because f= | is an alias of f= 0. Continuous Fourier components 
are written as S(f), a function of f, so we re-label the above diagram with normalized frequencies: 


basis 
f : 
a ac Complex Frequency 
WY ee Si Components 
complex samples 
n=10 | | I 
7 Li ¢ 
0123456789 J 0.1.2 3.4.5.6 7.8.9 * 


sampled interval 


Figure 11.3 Samples in time, and their normalized frequencies. For simplicity, all values are shown 
as real, but in general, are complex. 


For theoretical analysis, it is often more convenient to have the frequency range be —0.5 < f< 0.5, instead 
of 0<f< 1. Since any frequency fis equivalent to (an alias of) f— 1, we can simply move the frequencies in 
the range 0.5 < f< 1 down to -0.5 < f< 0: 


S(f) Complex Frequency S(f) Complex Frequency 
Components Components 


=_ 
He ante 0 2 Lf, 


Figure 11.4 Even number of samples: two choices for the frequency set. The set including negative 
frequencies (right) is asymmetric. 


Note that the lowest frequency is —0.4, and the highest is +0.5. For an even number of samples (and 
frequencies, diagram above), the resulting frequency set is necessarily asymmetric, because there is no f= 
—0.5, but there is an f= +0.5. 


For an odd number of points (below), the frequency set is symmetric, and there is neither f=—0.5 nor f 
= 40.5: 
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S(f) Complex Frequency S(f) Complex Frequency 
Components n=5 Components 


Figure 11.5 Odd number of samples: two choices for the frequency set. The set including negative 
frequencies (right) is symmetric. 


Non-Equivalence of DFT and FT of Series of Time-Domain Impulses (Again) 


As noted earlier, it is often said that sampling is like setting the function to zero between samples, or 
creating a series of impulse functions. This is a common misconception. It is well refuted by [Openheim 
and Schafer p??, and dozens of other signal processing experts]. It is easy to show that that claim is not true, 
in several ways. One simple way is this: For a band-limited signal, I can reconstruct the signal between the 
sample times from just the samples alone. That makes no sense if sampling amounted to zeroing the signal 
between samples, because that would be a new function, which would destroy information about the original. 
There would then be no way to recreate the original function from its DFT. 


Furthermore, it is often said that the FT of a series of time-domain impulses is identical to the DFT of 
samples at those times. From the previous paragraph, this cannot be true, either. However, the following is 
true only at the (integer) k defined DFT frequencies of ka: 


nol ‘ ~ | nel 
S (a) =S(key)=— 568 =-(" ¥-5}6(t-t;) eo ikont ay 
J=8 j=0 


@, = fundamental frequency. 


For frequencies in between the ka, the DFT is formally undefined, but can be taken as zero for the purpose 
of reconstructing the original samples. However, at those in-between frequencies the FT has some non-zero 
values which are usually of little interest. So we see that the FT of a series of weighted impulses (representing 
the sample values), evaluated at the DFT frequencies ka, is proportional to the DFT, but the full-spectrum 
of the FT is different from the spectrum of the DFT. Hence, the two transformations are not equivalent. 


Basis Functions and Orthogonality 


The basis functions of the DFT are the discrete-time exponentials, which are equivalent to sines and 
cosines: 


by (j)= ell2zk/n) i 


where _j =sample index = 0,1,...n—1, 
neven: —n/2+1,...-1,0,1...n/2 
k = frequency index = 0,1,...1—1 or 
nodd: —int(n/2),...—1,0,1,...int(n/2). 
Note that: 


The DFT and FT are simply decompositions of functions into basis functions, 


just like in ordinary quantum mechanics. The transform equations are just the inner products of the 
given functions with the basis functions. 


The basis functions are orthogonal, normalized (in our convention) such that (Dy a) =n ,,,. Proof: 
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n-1 -1 n-1 n-1 


(bj, |Pn) = bP (j)= > ~i(27k/n)j, i(2am/n) j =Si el (2z/n)(m- => [ef i(2/n)(m-k) My 


j=0 j=0 j=0 j=0 
n—-1 j 
For k =m, we have (b: lon) = > [e° | =n. 
j=0 
n-| ; j—r* ; 
For k #m, use ri = where r= lees Then: 
‘20 1-7 
i(2x/n)(m-k) |" ; 
|= pil aln)(m ] 1 — ei(2z)(m-*) 1-1 


(b¢ |bn) = [erik] 1rd] “Eee 


The orthogonality condition allows us to immediately write the DFT from the definition of the IDFT 
above: 


Discrete Fourier Transform: 


1 
1 n 4 7 : 
Sp = -> sje amet O) , where S, =the Kh complex frequency component 
n 
j=0 


= the normalized frequency of the aa component . 


Note that there are 21 independent real numbers in the complex sequence sj, and there are also 2n independent 
real numbers in the complex spectrum S;, as there must be (same number of degrees of freedom). 


Real Sequences 


An important case of sequence s; is a real-valued sequence, which is a special case of a complex-valued 
sequence. In this section, we use the positive and negative frequency DFT form, where k takes on both 
an/2 


: ae i2 

negative and positive integer values. Then for s; = >» ST ia 
ke—n/2 

complex conjugate pairs, i.e., the spectrum S, must be conjugate symmetric: 


kIn)j : 
a to be real, the S$; must occur in 


Sp = e for s; real, and k <int(n/2)+1. 


This implies that So is always real, which is also clear since Sp is just the average of the real sequence. 


We now discuss the lower limit for k. (As discussed earlier, there is no k = —n/2). There are n 
independent real numbers in the real sequence s;. We now consider two subcases: n is even, and n is odd. 


For n even, 
n/2 
i2n(k/n)j 
‘= ‘? S,e ee neven, s ; real, 
k=-n/2+1 


and we use the asymmetric frequency range —-0.5 < f< 0.5, which corresponds to —n/2 + 1 <k <n/2 (Figure 
11.6, left). For an even number of sample points, since there are no conjugates to k = 0 or k= n/2, we must 
have that So and S,,2 are real (actually, So being real still satisfies conjugate symmetry). All other S; may be 
complex, and are conjugate symmetric: S_¢= S;". 
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Complex Frequency Complex Frequency 
n=10,| S(f) Components n=9, S(f) Components 
Be 2 Ses va Se _— So real 
| ft 
-4-3-2-10.1 2.3 4 55 -44 -.33 -.22 --11 0 11 .22 .33 44 
os SyYVne=xm2XxXZJ. [sd Ld 


a Oe 


conjugate symmetric conjugate symmetric 


Figure 11.6 (Left) Sequence with even number of samples, n = 10. (Right) Sequence with odd 
number, n = 9. 


Therefore, in the spectrum, there are (n/2 — 1) independent complex frequency components, plus two real 
components, totaling m independent real numbers in the spectrum, matching the n independent real numbers 
in the sequence sj. 


In terms of sine and cosine components (rather than the complex components), there are (n/2 + 1) 
independent cosine components, and (n/2 — 1) independent sine components. All frequencies are 
nonnegative: 


n/2-1 
5, =Agt+ > {A cos| 22(k/n) j |+ By sin[ 20 (k/n)j |}+A,;9 cos xj 
a (11.1) 


neven, 5; real. 


Note that in the last term, cos zj is just an alternating sequence of +1,—1, +1, .... 
For an odd number of points (Figure 11.6, right), 
(n-1)/2 


s,= > Si 


; ea? 
k=-(n-1)/2 


nodd, 


there is no k = n/2 component, and again there is no conjugate to k= 0. Therefore, we must have that So is 
real. All other S; are conjugate symmetric. Therefore, in the spectrum, there are (n — 1)/2 independent 
complex frequency components, plus one real component (So), totaling n independent real numbers in the 
spectrum, matching the n independent real numbers in the sequence sj. 


In terms of sine and cosine components (rather than the complex components), there are (nm + 1)/2 
independent cosine components, and (m — 1)/2 independent sine components. All frequencies are 
nonnegative: 


(n-1)/2 


s;=Ay+ >) {Ay cos[2x(k/n) j]+B, sin[2x(k/n)j]}}, nodd, s; real. (11.2) 
k=0 


Note that there is no final lone-cosine term, and no alternating sequence. 

These examples illustrate how the notation is slightly more involved for the cosine/sine form than for 
the complex exponential form. 
Normalization and Parseval’s Theorem 


When the original sequence represents something akin to samples of voltage over time, we sometimes 
speak of “energy” in the signal. The energy of the signal is the sum of the energies of each sample: 
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where Gz='"conductance", choosen to be 1. 


When the “conductance” is chosen to be 1, or some other reference value, the “energy” in the signal does not 
correspond to any physical energy (say, in joules). 


The energies of the sinusoidal components in the DFT add as well, because the sinusoids are orthogonal 
(show why??): 


Parseval’s Theorem equates the energy of the original sequence to the energy of the sinusoidal components, 
by providing the constant of proportionality. First, we evaluate the energy of a single sinusoid: 


eM 2 
E, = |S; ° > Zonas =n |S; if where k =frequency index =0,1,...n—1. 
j=0 
Then summing over all frequencies yields: 
n-1 n-1 n-l -s n-1 
E=) =n) |8,) => Ban» |S. =>.37° |: (11.3) 
k=0 k=0 k=0 j=0 


Energy For Real Sequences: We derive Parseval’s Theorem for real sequences in two ways. Since 
the s; are real, the interesting form is the cosine/sine form of DFT, (11.1) and (11.2). We again consider 
separately the cases of n even, and n odd. 


First, for n even, k runs from 0 to (n/2). We can deduce the equation for Parseval’s Theorem by 
exploiting the conjugate symmetry of the S;. Recall that So has no conjugate term, nor does S211. Therefore: 


Ey =nAy’, En/2+1 =n(Aq/2) 
For k = 1, ... n/2, we have: 
A, =2Re{S,}, B,=2Im{S,}, |S, +|S_,| =2Re{s,}" +2Im{s,}? => 
"(4,2 2 = 
Ey =F (4 +B), k=1,..n/2. 


We can derive this another way directly from the definition (11.1). Since Ao is a constant added to each 
sj, the energy contributed from this term is Ey = nAo. Since cos 7 is just alternating +1, —1, ..., it’s energy at 
each sample is 1, and En2+1 = (An). Finally, the average value of cos? over a full period is ¥%, as is the 
average of sin’. Therefore, for k = 1, ... n/2, Ex = (n/2)(Ar? + By)”. 


Second, for n odd, k runs from 0 to (n — 1)/2. The above arguments still apply, but there is no lone- 
cosine term at the end. Therefore the result is the same, without the lone-cosine term. 


Summarizing: 


neven: E=nAp +n(Ayjo) +— (A. +B.) 
= 
(n-1)/2 


nodd: E=nAy +> > (A? +B.) 
k=1 
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Other normalizations: Besides our normalization choice above, there are several other choices in 
common use. In general, between the DFT, IDFT, and Parseval’s Theorem, you can choose a normalization 
for one, which then fixes the normalization for the other two. For example, some people choose to make the 
DFT and IDFT more symmetric by defining: 


n-l n-l 
2 ee 
=> > Is k | = ys i (alternate normalizations) . 
2ak/n)j k=0 j=0 


1 = (i2ak/n)j 
IDFT: s,=—~) S,e a 
J Ge k 


Continuous and Discrete, Finite and Infinite 


TBS: Finite length implies discrete frequencies; infinite length implies continuous frequencies. Discrete 
time implies finite frequencies; continuous time implies infinite frequencies. Finite length is equivalent to 
periodic. 


White Noise, Correlation, and Gaussianity 


White noise has, on average, all frequency components equal (named by incorrect analogy with white 
light); samples of white noise are uncorrelated. Non-white noise has unequal frequency components (on 
average); samples of non-white noise are necessarily correlated. (Show this??). 


White noise may or may not be gaussian. Gaussian noise may or may not be white. 


For example, shot noise (aka “impulse noise”) is white, but not at all gaussian. Pink noise may be gaussian, 
but is not white. Gaussian noise, after being sent through a linear filter, is still gaussian (the weighted sum 
of any number of gaussian RVs is still gaussian). Clearly though, the noise after filtering has the frequency 
transfer function of the filter impressed on it, and is not white. 


Why Oversampling Does Not Improve Signal-to-Noise Ratio 


Sometimes it might seem that if I oversample a signal (i.e., sample above the Nyquist rate), the noise 
power stays constant (= noise variance is constant), but I have more samples of the signal which I can average. 
Therefore, by oversampling, I should be able to improve my SNR by averaging out more noise, but keeping 
all the signal. 


This reasoning is wrong, of course, because it implies that by sampling arbitrarily fast, I can filter out 
arbitrarily large amounts of noise, and ultimately recover anything from almost nothing. So what’s wrong 
with this reasoning? Let’s take an example. 


Suppose I sample a signal at 100 samples/sec, with white noise. Then my Nyquist frequency is 50 Hz, 
and I must use a 50 Hz Low Pass Filter (LPF) for anti-aliasing before sampling. This LPF leaves me with 
50 Hz worth of noise power (= variance). 


Now suppose I double the sampling frequency to 200 samples/sec. To maintain white noise, I must open 
my anti-alias filter cutoff to the new Nyquist frequency, 100 Hz. This doubles my noise power. Now I have 
twice as many samples of the signal, with twice as much noise power. I can run a LPF to reduce the noise 
(say, averaging adjacent samples). At best, I cut the noise by half, reducing it back to its 100 sample/sec 
value, and reducing my sample rate by 2. Hence, I’m right back where I was when I just sampled at 100 
samples/sec in the first place. 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 185 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


6 Discrete white ‘s Discrete white 3 Discrete correlated 

Z noise spectrum Z noise spectrum Z noise spectrum 
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Nyquist frequency Nyquist frequency Nyquist frequency 
ramp = 100 samples/sec samp = 200 samples/sec samp = 200 samples/sec 


But wait! Why open my anti-alias LPF? Let’s try keeping the LPF at 50 Hz, and sampling at 200 
samples/sec. But then, my noise occupies only % of the sampling bandwidth: it occupies only 50 Hz of the 
100 Hz Nyquist band. Hence, the noise is not white, which means adjacent noise samples are correlated! 
Hence, when I average adjacent samples, the noise variance does not decrease by a factor of 2. The factor of 
2 gain only occurs with uncorrelated noise. In the end, oversampling buys me nothing. 


Filters TBS?? 


FIR vs. IIR. Because the data set can be any size, and arbitrarily large: 


The transfer function of an FIR or IIR is continuous. 


Consider some filter. We must carefully distinguish between the filter in general, which can be applied 
to any data set (with any n), and the filter as applied to one particular data set. Any given data set has only 
discrete frequencies; if we apply the filter to the data set, the data set’s frequencies will be multiplied by the 
filter’s transfer function at those frequencies. But we can apply any size data set to the filter, with frequency 
components, f= k/n, anywhere in the Nyquist interval. For every data set, the filter has a transfer function at 
all its frequencies. Therefore, the filter in general has a continuous transfer function. 


H(f) H(f) H(f) 
f f f 


0.5 0.5 0.5 


Figure 11.7 Data sets with different n sample the transfer function H(f) at different points. H(f), in 
general, is a continuous curve, defined at all points in the Nyquist interval f € [0, 1) or (-0.5, +0.5]. 


What Happens to a Sine Wave Deferred? 


“.. Maybe it just sags, like a heavy load. Or does it explode?” [Sincere apologies to Langston Hughes. ] 
You may ask, “The DFT has only a finite set of basis frequencies. Can I use a DFT to estimate the frequency 
of an unknown signal? What happens if I sample a sinusoid of a frequency in between the chosen DFT basis 
frequencies? What is its spectrum?” Good questions. We now demonstrate. The results here are important 
for generalizing the DFT, and spectral analysis in general, to non-uniformly sampled signals. 


We choose 1 = 40 samples, which means the basis frequencies are k(1/n), k =—19, ... 0, ... 20, measured 
in cycles per sample (or equivalently, in units of the sampling rate, fsamp). The frequency spacing is 1/n = 
0.025 cycle/sample. No other frequencies exist in the DFT. 


First, we show the result of sampling an existing-frequency sinusoid of f= 10/n = 0.25 cycle/sample (k 
= 10). Since the signal is real, the spectrum is conjugate symmetric (S_~ = Sx’); therefore, I show only the 
positive frequencies, and double their magnitudes: 
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s;= cos( 224 i= bg cycle/sample . 
n n 


i) 0.1 0.2 0.3 04 0.5 0.6 


Figure 11.8 (Left) A sampled sinusoid of f= 0.25, n = 40. (Right) As expected, its magnitude 
spectrum (DFT) has exactly one component at f= 0.25, with magnitude 1.0. 


[Notice that when the sample points are connected by straight segments, the sinusoid doesn’t “look” sinusoidal, 
but recall that connecting with straight segments is not the proper way to interpolate between samples. ] 


The “energy” of the sample set is exactly (1/2)40 = 20, because there is an integral number of cycles in 
the sample set, and the average energy of a sinusoid is 2. 


Now we take our signal off-frequency by half the frequency spacing: f= 10.5/n = 0.2625 cycle/sample, 
halfway between two chosen basis frequencies: 


(Left) A sampled sinusoid of f = 0.2625, n = 40. (Right) Its magnitude spectrum (DFT) has 
components everywhere, but is peaked around f= 0.2625. 


Not too surprisingly, the components are peaked at the two basis frequencies closest to the sinusoid 
frequency, but there are also components at all other frequencies. This is an artifact of sampling a pure 
sinusoid of a non-basis frequency for a finite time. Note also that the total energy in the sampled signal is 
slightly /arger than that of the f= 0.25 signal, even though they are both the same amplitude. This is due to 
a few more of the samples being near a peak of the signal. This shift in total energy is another artifact of 
sampling a non-basis frequency. For other signal frequencies, or other time shifts, the energy could just as 
well be Jower in the sampled signal. This energy shift also explains why the two largest components of the 
spectrum are not exactly equal, even though they are equally distant from the true signal frequency of f= 
0.2625. 


Finally, instead of being half-way between allowed frequencies, suppose we’re only 0.2 of the way, f= 
10.2/n = 0.255 cycle/sample: 
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(Left) A sampled sinusoid of f = 0.255, n = 40. (Right) Its magnitude spectrum (DFT) has 
components everywhere, is asymmetric, and peaked at f= 0.25. 


The two largest components are still those surrounding the signal frequency, with the larger of the two being 
the one closer to the signal frequency. 


These examples show that a DFT, with its fixed basis frequencies, can give only a rough estimate of an 
unknown sinusoid’s frequency. The estimate gets worse if the unknown signal is not exactly a sinusoid, 
because that means it has an even smaller spectral peak, with more components spread around the spectrum. 


Other methods exist for estimating the frequency of an unknown signal, even one that is non-uniformly 
sampled in time. If the signal is fairly sinusoidal, one can correlate with a sinusoidal basis frequency, and 
numerically search for the frequency with maximum correlation. This avoids the discrete-frequency 
limitation of a DFT. Other methods usually require many periods of data, e.g. epoch folding [Leahy, Ap J, 
1983, vol 266, p160??]. 


Nonuniform Sampling and Arbitrary Basis Functions 


So far, we have used a signal sampled uniformly in time. We now show that one can find a Fourier 
transform of a signal with any set of n samples, uniform or not. This has many applications: some 
experiments (such as lunar laser ranging) cannot sample the signal uniformly for practical, economic, or 
political reasons. Magnetic Resonance Imaging (MRI) often uses non-uniform sampling to reduce imaging 
time, which can be an hour or more for a patient. 


We write the required transform as a set of simultaneous equations, with ¢; as the arbitrary sample times, 
and keeping (for now) the uniformly spaced frequencies: 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 188 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


n-1 


S(t) = DS exp(i(27k /n)to) 


k=0 


n-1 


s(t,)= Dy exp(i(2k /n)t, ) 


k=0 


n-1 


s(t,4)= DS exp(i(2zk /n)t, 1). 


k=0 
Or: 
S(to) exp(27fotg)  exp(27fito) .. exp(27f,-140) |[ So 


s(t) |_ exp(27fot,)  exp(2z7fit,) ...  exp(27f,_44) |] S, 


S(ty-1) exp(27fotn_1) exp(27fitp1) .- exp(27f,—ytn-1) || Spt 


How can we find the required coefficients, S;? 


The exponential functions are no longer orthogonal over the sample times; 
they are only orthogonal over uniformly spaced samples. 


Nonetheless, we have n unknowns (So, ... Sn1), and n equations. So long as the basis functions are 
linearly independent over the sample times, we can (in principle) solve for the needed coefficients, S;. We 
have now greatly expanded our ability to decompose arbitrary samples into basis functions: 


We can decompose a signal over any set of sample times into any set of linearly independent 
(not necessarily orthogonal) basis functions. 


Note that Parseval’s theorem does not apply to the coefficients, since the basis functions (evaluated at 
the non-uniform sample points) are no longer orthogonal. Also, So is no longer the average of the signal 
values, since the sinusoids may have nonzero average over the sample points. 


There is one more subtlety: what is the fundamental frequency fo? Equivalently, what is the signal 
period? The two are related, because fo = 1/period. There is no unique answer to this. However, since a 
finite signal transforms as if it is periodic, the period cannot be (f,-1 — fo), since the first and last samples 
would then have to be identical. The period must be longer than that. A convenient choice is to simply 
mimic what happens when the samples are uniform. In that case, 


period = (ty ~t0) So =1/ period . 


n = ? 
This choice for period reproduces the traditional DFT when the samples are uniform, and is usually adequate 


for non-uniform samples, as well. 


Example: DFT of a real, non-uniformly sampled sequence: We can set up the matrix equation to be 
solved by recalling the frequency layout for even and odd n, and applying the above. We set fo = 0, and 
define n/2 as floor(n/2). For illustration of the last two columns, we take n odd: 
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n/2 

S(ty) = > S [ cos (katy )+sin (ket, ) | where @=2n/n 
k=0 
n/2 


s(t) = Ky [ cos (ken )+ sin(keot; ) | 


k=0 


n/2 


si) ay [ cos (kt, _;)+sin(ket, 1) ]. 


k=0 
Or: 
sf) ] [10 1.0 00... 1.0 0.0 5 
s(t) 1.0 cos(@t,) sin(wt,) ...  cos((n/2)ar,) — sin((n/2)or,) || 5, 
s(t,1)] [1.0 cos(ot,;) sin(@t, 1) .. cos((n/2)ot,,) sin((n/2)ort, 1) || S.1 


This gives us the sine and cosine components separately. For n even, the highest frequency component is k 
= n/2, or w = 2ak/n = 2a(1/2) = w rad/sample, and the final column of sin(-) is not present. 


Note that this is not a fit; it is an exact, reversible transformation. The matrix is the set of all the basis 
functions (across each row), evaluated at the sample points (down each column). The matrix has no 
summations in it, and depends on the sample points, but not on the sample values. 


Example: basis functions as powers of x: In the continuous world, a Taylor series is a decomposition 
of a function into powers of (x — a), which are a set of linearly independent (but not orthogonal) basis 
functions. Despite this lack of orthogonality, Taylor devised a clever way to evaluate the basis-function 
coefficients without solving simultaneous equations. 


Example: sampled standard basis functions: We could choose a standard (continuous) mathematical 
basis set, such as Bessel functions, J,(f). For n sample points, ft, ... fn, the Bessel functions are linearly 
independent, and we can solve for the coefficients, Ay. We need a scale factor a for the time (equivalent to 
2zk/n in the Fourier transform). For example, we might use a = the (n -1)" zero of Jn«(f). Then: 


n-l 
S(tg) = Ade C =) 


n-1 


k=0 
n-l Z 
X Ty-1 


n-1 
a 
Str) = > Ard in a 
k=0 n- 


1 


We have n equations and n unknowns, Ao, ... An-1, SO We can solve for the Ax. 


Don’t Pad Your Data, Even for FFTs 


Old fashioned FFT implementations required you to have N = a power of 2 number of samples (64, 1024, 
etc.). Modern FFT implementations are general to any number of samples, and use the prime decomposition 
of N to provide the fastest and most accurate DFT known. The worst case is when N is prime, and no FFT 
optimization is possible: the DFT is evaluated directly from the defining summations. But with modern 
computers, this is usually so fast that we don’t care. 
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In the old days, if people had a non-power-of-2 number of data points, they used to “pad” their data, 
typically (and horribly) by just adding zeros to the end until they reached a power of 2. This introduced 
artifacts into the spectrum, which often obscured or destroyed the very information they sought [Ham p??]. 


Don’t pad your data. It screws up the spectrum. 
With a modern FFT implementation, there is no need for it, anyway. 
If for some reason, you absolutely must constrain N to some preferred values, it is much better to throw 
away some data points than to add fake ones. 


Two Dimensional Fourier Transforms 


One dimensional Fourier transforms often have time or space as the independent variable. Two 
dimensional transforms almost always have space, say (x, y), as the independent variables. The most common 
2D transform is of pictures. 


In the continuous world of light, lenses can physically project a Fourier transform of an image based on optics, 
with no computations. This allows for filtering the image with opaque masks, and re-transforming back to the 
original-but-filtered image, all at the speed of light with no computer. But digitized images store the image as pixels, 
each with some light intensity. These are computationally processed by computer. 


Basis Functions 

TBS. Not sines and cosines, or products of sines and cosines. Products of complex exponentials. Wave 
fronts at various angles, discrete k, and ky. 
Note on Continuous Fourier Series and Uniform Convergence 


The continuous Fourier Series is defined for a periodic signal s(t) over a continuous range of times, 


t € [0, T): 


s(t) = > St. where k@p is the frequency of the k”” component 
k=0 
Sy is the complex amplitude of the component 


Note that the time interval is continuous, but the frequency components are discrete. In general, periodic 
signals lead to discrete frequency components. 


The continuous Fourier Series is not always uniformly convergent. 
Therefore, the order of integrations and summations cannot always be interchanged. 


Non-uniform convergence is illustrated by the famous Gibbs phenomenon, frequently observed in 
digital electronics: when we transform a square wave to the frequency domain (aka Fourier space), then retain 
only a finite number of frequency components, and then transform back to the time domain, the square wave 
comes back with overshoot: wiggles that are large near the discontinuities: 


overshoot 
original square 
wave 


reconstructed 
wave 


Figure 11.9 Gibbs phenomenon: (Left) After losing high frequencies, the reconstructed square 
wave has overshoot and wiggles. (Right) Retaining more frequencies reduces wiggle time, but not 
amplitude. 
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As we include more and more frequency components, the wiggles get narrower (damp faster), but do not 
get lower in amplitude. This means that there are always some time points for which the inverse transform 
does not converge to the original square wave. Such wiggles are commonly observed in many electronic 
systems, which must necessarily drop high frequency components above some cut-off frequency. 


However: 


Continuous signals have Fourier Series that converge uniformly. This applies to most physical 


phenomena, so interchanging integration and summation is valid [F&W p217+]. 


This is true even if the derivative of the signal is discontinuous. 


Fourier Transforms, Periodograms, and Lomb-Scargle 


29 66 


In some circles, one hears the terms “Fourier Transform,” “periodogram,” and “Lomb-Scargle” a lot. 
Each of these is distinct, but they are related. Understanding the differences can help you analyze your data. 
We provide here an overview of some common signal processing algorithms. Be warned: 


Because spectral analysis can be tricky, its practice is rife with misunderstanding and mythology. 


Throughout the text, I will occasionally note common misunderstandings, but there are too many for me to 
correct them all. We address the Lomb-Scargle algorithm in particular, since it is widely misunderstood. 


Correspondingly, the terminology is also highly confused and abused. We define here some common 
terms in ways that are consistent with the majority of our (limited) experience in the literature. However, 
there appears to be little universal agreement on precise definitions, especially across different disciplines. 
(Words are the tools of communication; it is impossible to make fine points with dull tools.) In this work, 
we adhere to the following definitions: 


e Spectral analysis is the examination of periodic components of a data sequence. 


e The energy of a single data point is its squared magnitude, and is always = 0 (the term “energy” 
derives from early applications where the squared magnitude was proportional to physical energy). 
The “energy” of a sample-set is the sum of the squares of the samples. The “energy” of a frequency 
component is the sum of the “energies” of that frequency taken over the sample times. 


e The power in a frequency component is its squared magnitude, and is often normalized in some 
specified way. In some references, the term “energy” is used for “power.” (As with “energy,” the 
“power” in a component might be unrelated to physical power.) In this work, we occasionally write 
“energy” and “power” in quotes, as a reminder that they are not physical energy and power. For 
non-uniform sampling, the “energy” of a frequency component is not proportional to its “power.” 


e The statistical significance is the false alarm probability, often called alpha @. Experimenters 
usually choose @ before analyzing the data. It is the probability that a pure noise signal will, by 
chance, suggest the presence of a signal. It is essentially the same as the p-value. Note that a lower 
significance means a result is more significant, i.e. more likely real, and less likely random. 
Nonetheless, authors often speak loosely of “higher” significance meaning “more significant” or a 
lower significance value. It is more clear to say “more significant” instead of “higher significance.” 


e A detection parameter is a statistic calculated for a trial signal that (roughly) tells how likely the 
trial is to be a real phenomenon, rather than a result of random chance. A higher significance means 
aresult is less likely to be random. In Lomb-Scargle, the significance of a frequency tells how likely 
that frequency is to be a significant component of the signal. Note that “significance” is different 
than “power.” 


e ADFT (Discrete Fourier Transform) is a precisely defined decomposition of a sequence (uniformly 
spaced or not) into a set of sinusoidal components. (An “FFT” is just an efficient way to perform a 
DFT in some limited cases. We have limited use for “FFT” here.) DFTs use uniformly spaced 
frequencies, but are easily extended to non-uniformly spaced frequencies. 
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e A periodogram is some kind of graph of periodic components of a sample set. There are many 
methods for producing a periodogram, which produce different results [2]. Therefore, a 
“periodogram” is less well-defined than a DFT. Usually, the frequencies in a periodogram are 
uniformly spaced, but the periodogram frequency spacing may be tighter than the DFT. 


e Lomb-Scargle is a formula for finding the significance of a given sinusoidal frequency in data. 


e A Lomb-Scargle periodogram is a graph of detection parameter vs. frequency, where each 
parameter is computed with the Lomb-Scargle algorithm. The “LS” in periodogram can stand for 
either “Lomb-Scargle” or “Least Squares”, since the Lomb-Scargle algorithm produces the 
detection parameter for a least squares sinusoidal fit. Note that the LS algorithm produces a 
detection parameter, not power, despite common belief to the contrary. 


Be careful to distinguish between uniformly spaced samples of data, 
and uniformly spaced frequencies in the periodogram. 


Caution: For orthogonal basis functions (as in a uniformly sampled DFT), the energy and power of 
every frequency are proportional, and therefore the terms are often interchanged. However, for non-uniform 
sampling, the “energy” of a frequency component is not proportional to its “power.” This is the crux of the 
confusion about the LS “periodogram.” The LS result is essentially the “energy” of a given sinusoidal 
frequency in the data, used to help find significant sinusoids in the data. 


The Discrete Fourier Transform vs. the Periodogram 


The single biggest distinction between a DFT and a Lomb-Scargle periodogram is that 


the DFT simultaneously optimizes all the components to form an exact transformation. 
A Lomb-Scargle periodogram examines each frequency by itself, regardless of other frequencies. 


The DFT is exact, and invertible, with no loss of information. At times, this can be a plus, but in many 
cases, this “exactness” results in anomalies. In particular, any set of physical measurements is only a subset 
of the exact representation of the physical phenomenon. In other words, a sample set is incomplete, and so 
the information contained in it is limited. In addition, all measurements contain some noise. If we put such 
a sample set into a DFT, it gives us frequency components which exactly match the given incomplete samples, 
noise and all. To achieve this exactness, the DFT must sometimes contort the spectrum in unphysical ways. 
In particular, highly non-uniformly sampled signals often result in large DFT artifacts. By definition, a DFT 
produces a spectrum of precisely defined, uniformly spaced frequencies. [However, one can easily compute 
an exact decomposition onto an arbitrary set of frequencies, and furthermore, onto an arbitrary set of basis 
functions that need not be sinusoidal. | 


As scientists, we often would rather see something less mathematically exact, and more physically 
meaningful. We combine all our knowledge of the system (and science in general) with our limited, noisy 
data, to reach new conclusions. A periodogram provides a way to look at frequency content of a signal, 
without some of the unphysical anomalies of an exact DFT. Also, a periodogram can plot results at an 
arbitrary set of frequencies, not just those defined in a DFT. In fact, periodograms usually choose a larger, 
and more densely packed, set of frequencies than a DFT produces. However, periodograms suffer from 
anomalies, as well. 


In a DFT, the frequency components don’t “overlap,” in the sense that none of the “information” of one 
component appears in any other component. This is true even though the basis sinusoids are not orthogonal 
over the given sample times. There is no “extra” information in the DFT: e.g., a sample set of n = 40 points 
transforms into exactly 21 cosines and 19 sines, having exactly the same 40 degrees of freedom as the original 
data set. 


In contrast, in a periodogram, the component powers are themselves correlated, and the information from 
one component also shows up in some of the other components (especially adjacent components). 
Furthermore, especially in small data sets [ref??], the non-orthogonality of the periodogram’s sinusoids may 
cause a single component of the data to produce spikes at multiple widely-spaced frequencies in the 
periodogram. This may mislead the user into believing there are multiple causes, one for each peak. Finally, 
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for any sample size n, we can make a periodogram with any number of frequencies, even far more than n. 
This again shows that the periodogram contains redundant information. 


Despite common belief, a Lomb-Scargle periodogram is not a periodogram of sinusoidal “power.” It is 
a graph of detection parameter vs. frequency, where each parameter is computed by a minimum least-squares 
residual fit of a single sinusoid at that frequency [3]. For large data sets, or well-randomized sample times, 
the parameter value approaches the power, so people often “get away with” confusing the two. However, 
for small data sets, or those where the sample times are clustered around a periodic event (say, nighttime), 
the significance of a frequency can be very different than its “power” estimate. Note that when the sample 
times are clustered around a frequency, say 1 cpd (cycle per day), it can affect many frequencies in the 
sample, especially near harmonics or sub-harmonics (e.g., 2 cpd, 3 cpd, 0.5 cpd, etc.). 


When fitting a sinusoid of given frequency to data, there are two fit parameters. These may be taken as 
cosine and sine amplitudes, or equivalently as magnitude and phase of a single sinusoid. The true “power” 
at that frequency (considered by itself) is the sum of the squares of the cosine and sine amplitudes, or 
equivalently, the square of the magnitude. 


Practical Considerations 


Here are a few possible issues with spectral analysis. Again, it is a highly involved topic, and these 
issues are only a tiny introduction to it. 


Removing trends: Before using spectral analysis, it is common to remove simple trends in the data, 
such as a constant offset, or straight line trends [ref??]. A straight-line introduces a complicated spectral 
structure which often obscures the phenomena of interest. Thus, removing it before spectral analysis usually 
helps. A constant offset introduces spurious frequency detections, especially for bunched samples, as are 
typical astronomical data. Also, constant offsets may lead to worse round-off error. Furthermore, even 
though you should never pad your data (see below), padding with zeros when your data has a non-zero 
average only compounds your error. 


Stepwise regression: Sometimes we have in our data a frequency component which is obscuring the 
phenomenon of interest. We may model (fit) that frequency, and subtract it from the data, in hopes of 
revealing the interesting data. Note that finding frequencies in our data, and subtracting them, one at a time, 
is simply the standard statistical method of stepwise multiple regression (not simultaneous multiple 
regression). We are “regressing” one frequency component at a time. Therefore, stepwise frequency 
subtraction has all the usual pitfalls of stepwise regression. In particular, the single biggest component may 
be completely subsumed by two (or more) smaller components. Therefore, when performing such stepwise 
frequency modeling, it may help to use the standard method of backward elimination to delete from the model 
any previously found component that is no longer useful in the presence of newer components. 


Computational burden: Many decomposition algorithms rely on some form of orthogonality, e.g., this 
is the basis (wink) of Discrete Fourier Transforms. Orthogonality allows a basis decomposition to be done 
by correlation (aka using inner-products). Recall that such a correlation decomposition, including Lomb- 
Scargle periodograms, requires O(n?) operations. In contrast, a non-orthogonal decomposition, such as a 
DFT over non-uniform sample times, solves simultaneous equations requiring O(n) operations, so can be 
much slower. For n = 1,000 samples, the non-orthogonal decomposition is about 1,000 times slower, and 
requires billions of operations. This may be a noticeable burden, even on modern computers (perhaps 
requiring many minutes). 


Smoothed DFT: One surprisingly common approach to making a periodogram (not Lomb-Scargle) is 
to make a DFT, with its possible anomalies, and then try to disperse those anomalies by “smoothing” the 
resulting graph of power vs. frequency. I believe smoothing a DFT is like trying to invest wisely after you’ve 
lost all your money gambling. It’s too late: you can’t get back what’s already lost. Likely much better is to 
make some other kind of periodogram in the first place, and don’t use a DFT, or use it only as guidance for 
more appropriate analyses. In particular, with highly non-uniformly spaced samples, the DFT anomalies 
include large (but unphysical) amplitudes, which are not removed by smoothing. Furthermore, smoothing a 
DFT of nonuniformly spaced samples requires O(n) operations, so it not only likely produces poor results, 
it does so slowly. 
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One possible advantage of the “smoothed DFT” approach is that for very large data sets (n > ~10,000), 
if n is amenable to a Fast Fourier Transform and your samples are uniformly spaced, then the DFT can be 
done in O(n log n) operations. A typical Lomb-Scargle periodogram requires O(n”) operations. However, 
[Press and Rybicki 1989] provide a way to use FFT-like methods to create a Lomb-Scargle periodogram, 
thus using O(n log n) operations. While still slower than a true FFT, this makes Lomb-Scargle periodograms 
of millions of data points feasible. 


Bad information: As mentioned earlier, many references (seemingly most references, especially on the 
web), are wrong in important (but sometimes subtle) ways. E.g., some references actually recommend 
padding your data (almost always a terrible idea, discussed elsewhere in Funky Mathematical Physics 
Concepts). Many references incorrectly describe the differences between uniform and non-uniform 
sampling, the meaning of FFT, aliasing, and countless other concepts. In particular, 


eee 
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This is a common misconception, which is discussed earlier in the section “Sampling.” 


The (Deprecated) Lomb-Scargle Algorithm 


We here describe the now-deprecated Lomb-Scargle (L-S) algorithm; the next section explains how it 
works. The algorithm is interesting for the concepts it uses, even though in practice, it is much better to use 
modern, uncertainty-weighted 3-parameter sinusoidal fits. 


The algorithm starts with n discrete measurements (samples), s;, taken at times #, j = 0, ... n—1. The 
algorithm first finds the time offset that makes the cosine and sine orthogonal over the given sample times: 


n-1 
> sin 2at ; 
n-1 : J 
4r such that Y ‘cos at | sin at; = 0. T satisfies tan(2@T) = ; 
J=0 > ‘cos 2at; 
j=0 . 


Note that t depends on @; so each w has its own t. Also, t depends on the sample times, but not on the 
measurements, s; You can use a principal valued (2-quadrant) arctangent to find t. 


Next, L-S subtracts out the average signal, giving samples: 


h;=s;—(s;) where (s;) =average of 5. 


Then the Lomb-Scargle normalized periodogram detection statistic is, in inner product notation: 


1 | (cos|h)” : (sin|h)” 


cD aoe (cos|cos) (sin |sin) 


[Pres&Ryb 1989 (3) p277]. 


wheres? = 


1 n-l 

> h - = sample variance . 
n-1¢ 

j=0 
We deliberately use the non-standard notation D(q), rather than P(@), to emphasize that the L-S parameter 
is a detection statistic, not a power (despite widespread belief). In particular, D(@) is dimensionless, and 
independent of measurement scaling, so it can’t be a “power.” Expanded in more conventional notation, the 
L-S normalized detection parameter at a given frequency « is [Pres&Ryb 1989 (3)]: 
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n-l n-1 : 
Sih; cos a(t; -7) Sih; sina(t; -7) 
D(a) = = = = 
: cos ot; -7) Yi sin? w(t; -7) 
j=0 j=0 


NB: The L-S algorithm demands equal uncertainties on the data, which is one reason it is deprecated. This 
is exactly the equation for the coefficient of determination one gets from a standard statistical fit which 
minimizes the residual signal in a least-squares sense (i.e., minimum residual energy) [4]. Such a fit is a 
simultaneous 2-parameter linear fit (for A and B) to the model: 


S(O) = Acos(a(t—r))+Bsin(@(t-7)), P.4e(@) = AZ +B. 


Prrue(@) is the true estimate of the “power” at w, because it is proportional to the squared amplitude of 
the fitted sinusoid at frequency w. For large data sets, or well-randomized sample times, D(@) approaches 
being proportional to Pirue(@) at all frequencies. Therefore, the parameter D(a) is often used as a substitute 
for the spectral power estimate, Pirue(@). As with most hypothesis testing, the presence of a spectral line 
(frequency) is deemed likely if the line’s parameter is substantially less likely than that expected from pure 
noise. Since both terms in the L-S formula are gaussian random variables (RVs), the Lomb-Scargle 
expression in brackets for pure gaussian noise is proportional to a y”,-2 distribution. The factor of 1/2 makes 
the probability distribution of D(@) approach a unit-mean exponential [3], rather than a y?,-2. However, the 
normalization by s? means that D(q) is exactly beta distributed (not F distributed, as thought for decades) [A. 
Schwarzenberg-Czerny, 1997]. 


Note that s? is (close to) the average “energy” (squared value) of the samples (remember that the average 
value of all the samples has been subtracted off, so the h; have zero average). The 1/s? in this equation makes 
the result independent of the signal amplitude, i.e. multiplying all the data by a constant has no effect on the 
periodogram. Also, for pure noise, D(w) is roughly independent of the number of samples, n, since s* is 
independent of n, and the numerators and denominators both scale roughly like n. The numerator summations 


scale like Vn , because they are sums of random variables (noise). 


In contrast, if a signal is present at frequency w, D(w) grows like n, because then the numerator 
summation grows like n. Thus, if a signal is present, it becomes easier to detect with a larger sample set, 
consistent with our intuition. 


The Meaning Behind the Math 


Understanding exactly what Lomb-Scargle does, and how it works, puts you in a powerful position to 
understand why it is deprecated. Also, if you ever want to develop a novel algorithm, or wonder how others 
develop them, Lomb-Scargle provides an interesting and informative example of the process. (However, our 
derivation here is very different from Lomb’s original [Lom], who first derived this result using the standard 
“normal equations” for least-squares fitting [Lom]). 


The Lomb-Scargle formula may look daunting, but we can understand and derive it in just a few high- 
level steps: 


1. Given our basis of cosine and sine, find a way to make them orthogonal. 

2. Use standard orthogonal decomposition of our data into our two basis functions. 

3. Normalize our coefficients, being careful to distinguish power-estimate from detection parameter. 
4. Prove that the correlation amplitude of the previous steps is equivalent to the least-squares fit. 


We complete these steps below, in full detail. 
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1. Make Cosine and Sine Orthogonal 


When making a LS periodogram, we are not performing a basis decomposition. We are separately 
finding correlations with each periodogram frequency, without regard to any other frequencies. For real- 
valued data (i.e., not complex), there are two basis functions at any frequency: cosine and sine. We need 
both to find the detection level (and also the “power’’) at that frequency. 


At any given frequency, w, we have two basis functions, cos(@?) and sin(@t), which we write as a sum: 
Acos(@t)+ Bsin(@t). Recall how a uniformly sampled DFT works: @ is a multiple of the fundamental 


frequency, and our sample times are uniformly spaced. Then cosine and sine are naturally orthogonal: 


: @ =a multiple of fundamental frequency 
n— 


cos(@ jAt)sin(@jAt) = 0, where 4 j =sample number 


j=0 At = sampling period = 1/ F samp : 


Using this orthogonality, we find our coefficients A and B separately, using inner products: 


n-1 
A =(s|cos) = s; cos(@jAr), B=(s|sin) =~ s;sin(@jAr). 
n 


J 


I 
So 


The power at frequency « is then A? + B?. 


In contrast, for arbitrary sample times ¢; (as in much observational data), or for arbitrary w, cos(-) and 
sin(-) are not orthogonal (i.e., they are “correlated”’): 
; o = arbitrary frequency 
n— 
C= ) cos (ot, )sin (ot, ) #0, where 4 j =sample number 
j=0 ; : 
t; = arbitrary sample times . 
Being correlated, we cannot use simple inner-products to find A and B separately. Furthermore, the presence 
of other components prevents us from simply simultaneously solving for the amplitudes A and B. 


Despite being correlated, cosines and sines are usually still a convenient basis, because they are the 
eigenfunctions of linear, time-invariant systems, and appear frequently in physical systems. So we ask: Is 
there a way to “orthogonalize” the cosines and sines over the given set of arbitrary sample times? Yes, there 
is, aS We now show. 


Consider the basis-function parameters we have to play with: amplitude, frequency, and phase. We are 
given the frequency, and are seeking the amplitudes. The only parameter left to adjust is phase (or 
equivalently, a shift in time). So we could write the correlation amplitude C above as a function of some 
phase shift ¢: 


nl 
C(¢) = Yicos( at; — 9)sin( at; -9) . 


j=0 


Can we find a phase shift ¢p) such that C(@p) = 0, thus constructing a pair of orthogonal cosine and sine? 
The simplest shift I can think of is z: cos(@t; + 2) = —cos(@t;), and similarly for sin(-). Thus a phase shift of 
z negates both cosine and sine, and the correlation is not affected: C(¢@ + z) = C(¢). The next simplest shift 
is 2/2. This converts cos(-) — sin(-), and sin(-) — —cos(-), so C(¢ + 2/2) = —C(g). This is great: C(@) is a 
continuous function of ¢, and it changes sign in every interval of z/2. This means that somewhere between 
-1/4 and +7/4, C(¢) = 0, i.e. the cosine and sine are orthogonal. 


The existence of a phase-shift ¢ which makes cosine and sine orthogonal is important, because we can 
always find the required ¢ numerically. Even better, it turns out that we can find a closed-form expression 
for go. We notice that the correlation C(¢) can be rewritten, using a simple identity: 
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n-l n-1 
sin20=2cos@sind => City) = 0 =D’ +sin (204; - 2H) or sin (2et;, — 29) =0. 
j=0 j=0 


Given the sample times ¢;, how do we find ¢? We can use geometry: let’s set ¢ = 0 for now, and plot the 
sum of the vectors corresponding to (x = cos(2@#), y = sin(2at;) ), for some hypothetical sample times, t;. 
Each vector is unit length (see Figure 11.10). 


sin(2@t)) 


yj sin 2ot; 


24) 2; Cos 2at; maaesaas COs at) 


Figure 11.10 Sum (blue) of 1 = 4 vectors corresponding to (x = cos(2m4#j), y = sin(2@t)) ). 


(This is equivalent to plotting the complex numbers exp(i2@t#;) in the complex plane.) If we rotate all the 
vectors clockwise by 2, then the sum of the sine components will be zero, as required. The components of 
the vector sum are the sums of the components, so: 


n-1 


>isin 2ar; 


n-1 
tan (2g ) = <> —_—_ > C(Yp) = > cos (wr; — dy )sin(wr; - gy) =0. 
> cos 20 j=0 
j=0 


In other words, we rotate each component (in the 2m set) by —2¢», which corresponds to rotating each 
component of our original (1@) set by —@. This gives the condition we need for orthogonality. Since the 


interval —7/4 to + z/4 must contain a ¢o, we can use a simple 2-quadrant arctangent, and divide the result by 
2. 


Any phase shift, at a given frequency, can be written as a time shift. By convention, Lomb-Scargle uses 
a subtracted time shift, so: 


n-1 
2aT = 2d > Cp) = ¥08( (1; -7))sin(@(r;-7))=0. 


j=0 


This time shift is fully equivalent to the @p phase shift, and the cosines and sines are orthogonal over the given 
sample times. Be careful to distinguish ¢, the orthogonalizing phase shift, from a fitted-sinusoid phase, 
usually called ¢. 


2. Use orthogonal decomposition of our data into our basis functions 


Now that we have orthogonal basis functions (though not yet normalized), we can find our cosine and 
sine coefficients with simple correlations (aka inner-products): 


n-l n-1 
A'=(h|cos) = )°h; cose(t, -t), B'=(h|sin)= A, sino(t, -r) (unnormalized) , 
j=0 j=0 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 198 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


where the primes indicate unnormalized coefficients. Note that, because the offset cosine and sine functions 
are orthogonal, A’ and B’ fit for both components simultaneously. That is, orthogonality implies that the 
individual best fits for cosine and sine are also the simultaneous best fit. 


3. Normalize Our Coefficients 


With all orthonormal basis decompositions, we require normalized basis functions to get properly scaled 
components. We normalize our coefficients by dividing them by the squared-norm of the (unnormalized) 
basis functions: 


n-1 
A‘ ) B' 


A= = B= 
(cos |cos) as : 


-1 
hj cos a(t -1) Sh sin ot; -t) 
_ J=0 


(sin [sin) I (normalized) . 


These formulas are similar to the two terms in the Lomb-Scargle formula. The normalized coefficients, A 
and B, yield a true power estimate for a best-fit sinusoid: 


Pye(@) = A” +B from Sj, (@) = Acosa(t; —1)+Bsino(t, -t). 


To arrive at the Lomb-Scargle detection parameter, we must consider not the true power estimate, 
Pirue(@), but the contribution to the total sample set “energy” (sum of squares) from our fitted sinusoid, Sf(@), 
at the given sample times. For example, for a frequency component with a given true power, if the sinusoid 
happens to be small at the sample times, then that component contributes a small amount to the sample-set 
“energy.” On the other hand, if the sinusoid happens to be large at the sample times, then that component 
contributes a large amount to the sample-set “energy.” 


The significance of a frequency component at q is a function of the ratio of the component’s energy to 
that expected from pure noise. Given a component with cosine and sine amplitudes A and B, its energy in 
the sample set is the sums of the squares of its samples, at the given sample times: 


n-1 


B(o)= J [ Acoso; -r) ‘Sle sino(t,~r)] 


j=0 
n-1 n-1 
— A? cos” ot, —t)+B?) sin” ot; -r) 
j=0 j=0 
n-1 2 n-1 2 
Yh cos a(t -t) Dh sin ot, -t) 
j=0 j=0 
n-1 = n-1 
2 2 
Y cos ot, -t) sin o(t;-7) 
j=0 j=0 


This is almost the Lomb-Scargle detection parameter. 


For white noise (not necessarily gaussian), with no signal, the samples h; are independent identically 
distributed (IID), with variance equal to the noise power, o”. Across many sets of noise, then, the numerators 
above have variance: 


n-1 nl 
on = oY cos*a(t; -t), on = oy 'sin*o(t; -r) ; 
j=0 j=0 


This means each term in E(q@) has variance = o°. 
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For gaussian noise, A and B are gaussian, and E(@) is the sum of their squares, scaled to the estimated 
variance = o°. Therefore, E(w)/o* (always < 1) is distributed with CDF an incomplete beta function [A. 
Schwarzenberg-Czerny, 1997]. (For decades, it was thought that E(@)/o was y’,=2 distributed, but it is easy 
to show that it is not: 7” has no upper bound, but E(@)/o? < 1.) Nonetheless, assuming the incorrect y7,=2 
distribution, Lomb-Scargle divides by 2 to get a more-convenient exponential distribution with yu = 1: 


ef 


This is the standard (though flawed??) formula for LS detection. Note again that we don’t know the true 07; 
we must estimate it from the samples. 


4. Prove that the correlation amplitude of the previous steps is equivalent to the least-squares fit 


This is a general theorem: any correlation amplitude for a component of a sequence s; is equivalent to a 
least-squares fit. We prove it by contradiction. Given any single basis function, bz, we can construct a 
complete, orthonormal basis set which includes it. In that case, the component of b; is found by correlation, 
as usual. Call it Ax. 


The least-squares residue is simply the “energy” (sum of squares) of the sequence after subtracting off 
the b, component. Since the basis set is orthonormal, Parseval’s theorem holds. Thus, the residual energy 
after subtracting the b; component from s; is the sum of the squares of all the other component amplitudes. 
If there existed some other value of Ax which had less residual energy, then that would imply a different 
decomposition into the other basis functions. But the decomposition into a given orthogonal basis is unique. 
Therefore, no A; other than the one given by correlation can have a smaller residual. 


This proof holds equally well for discrete sequences s;, and for continuous functions s(?). 


Bandwidth Correction (aka Bandwidth Penalty) 


Determining the significance of a signal detection requires some care, since most algorithms search for 
any one of many possible signals. For example, when searching for periodic signals in noisy data, one often 
searches many trial frequencies, and a “hit” on any frequency counts as a detection. How do we determine 
the significance (p value) of such a detection? p is also called the “false alarm” probability. 


All the common periodic-signal detection algorithms require bandwidth correction, because if one makes 
enough attempts, even an unlikely outcome will eventually happen. If one tries many frequencies, the 
probability that one of them exceeds a threshold is much higher than the probability of a single given 
frequency exceeding that threshold. From elementary statistics, if the parameters for all frequencies are 
independent, the probability that they are all not false alarm (FA) is the product of the probabilities that each 
one is not false alarm. For M independent parameters at various frequencies, and a given p-value, then in 
our gaussian white noise case (i.e., the standard Null Hypothesis of no signal): 


Pr(all not FA) = Pr(one not FA)” = (1 — Dif yn 
M 
confidence level =1— p= Pr(all not FA) = (1 — Pir ) 


Therefore, to achieve an overall p-value for all frequencies of p, we must choose the p-value for each trial 
frequency such that [Schwarzenberg-Czerny (1997)]: 


M 
1- p=(1-pi,) : => py =1-(1-p)™. 
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Since p is usually small (<~ .05), we can often use the binomial theorem to approximate: 
1/M 
(1- p) x1l-—p/M, Pip = P/M . 


For simulations, we may want to estimate M from p and pi For example, we choose py, measure p, and 
from that estimate M. Solving the above for M: 


2 7 Z In(1- p) 
In(1-p)=M In(1- pi, ), M infin) 


Larger M is more demanding on your data. 
Being conservative on a claim of detection means favoring larger M. 


In most period-searching methods (except for the DFT), we are free to search as many frequencies as we 
like, at as dense a trial frequency spacing as we like. We call our significance parameter 0 = O(f), because it 
is a function of frequency. Intuitively, we expect that two very close frequencies will produce similar 0 
values, and indeed, such 6-values are correlated (in the precise statistical sense). So our problem reduces to 
determining M, the number of independent frequencies in our arbitrary set of frequencies. 


The bottom line is that, for dense trial frequencies, M is approximately the same as if we had equally 
spaced samples, and therefore a simple DFT [Press 1988]. Such a DFT has independent frequency 
components. This simple-sounding result, however, requires understanding a few subtleties, especially when 
the trial frequencies are sparse. 


We consider 3 cases, starting with the simplest: 
e Uniformly space data points, uniformly spaced frequencies (i.e., a DFT). 
e =6 Arbitrarily spaced data points, uniformly spaced frequencies. 


e = Arbitrarily space data points, arbitrarily spaced frequencies. 


Notation: 
é significance parameter, such as Lomb-Scargle, Phase Dispersion Minimization, etc. 
Af the independent frequency spacing. 
N number of data samples. 
M number of independent 0 values over our chosen set of frequencies. 
BW the total range of frequencies tried: BW = finax — finin- 
T the total duration of samples: T = tinax — tin. 


Equally spaced samples: If our samples our equally spaced, we have the common case of a Discrete 
Fourier Transform (DFT). For white (i.e., uncorrelated) noise, each frequency component is independent of 
all the others. Furthermore, in the relevant notation (where all frequencies are positive), the maximum 
number of frequencies, and their spacing, is: 


max # DFT frequencies = N/2, Af =1/T. 


Note that the maximum number of frequencies depends only on the number of data points, N, and not on T. 
However, we may be looking for frequencies only in some range (Figure 11.2). 
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sy) N= 10 


0.1.2.3.4.5 617.89 
— BW 


Figure 11.11 Sample frequency spectrum for uniformly spaced discrete time data (here Af = 0.1). 
BW defines a subset of frequencies (here BW = 0.325). 


Therefore, for dense trial frequencies (Afiriai < Af), the number of independent frequencies is approximately: 


M = BW/Af =(BW)T =0.325/0.1=3.25 4. 


We round M up to be conservative. 


Arbitrarily spaced samples: In astronomy, the data times are rarely uniformly spaced. In such cases, 
we usually choose our trial frequency spacing, Afjriai, to be dense, i.e., smaller than the independent frequency 
spacing: Afjriai< Af~1/T. Then, per [Press 1988], we use the same equations as above to approximate Af 
and M: 


Af ~1/T, M = BW/Af =(BW)T, (Afiriat < Af) « (11.4) 


Note that this is true even if BW > Nyquist frequency, which is perfectly valid for nonuniformly spaced time 
samples [Press 1988]. 


In the unusual case that our trial frequency spacing is large, Aftria > Af, then we approximate that each 
frequency is independent: 


M ~# trial frequencies, (Afiriat > Af ). (11.5) 


(In reality, even if 9 values separated by Af are truly independent, some 6 values separated by more than Af 
will be somewhat correlated. However, the correlation coefficient “envelope” mostly decreases with 
increasing frequency spacing. Nonetheless, these correlations imply that there are parasitic cases where this 
approximation, eq. (11.5), fails.) 


Arbitrarily spaced trial frequencies: One common situation leading to nonuniformly spaced trial 
frequencies is that of uniformly space trial periods. If the ratio of highest to lowest period is large (say, > 2), 
then the frequency spacing is seriously nonuniform. 


We may think of Af as approximately the difference in frequency required to make the @ values 
independent. (In reality, even if 6 values separated by Af are truly independent, some 6 values separated by 
more than Af will be somewhat correlated. However, the correlation coefficient “envelope” decreases with 
increasing frequency spacing.) In such a case, we may break up the trial frequencies into (1) regions where 
Aftriat < Af, and (2) regions where Afiria > Af (Figure 11.12). 


so) N= 10 


Lit | >f 
0.1.2.3 4.5 6.7.8.9 


Mf rrial < Af + * Merial oa Af 


Figure 11.12 Nonuniformly spaced trial frequencies. 
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The region where Afjriai < Af behaves as before, as does the region where Afiriai > Af. In the example of 
Figure 11.12, we have: 


Af =0.1, M ~[(0.5-0.1)/0.1]+2=6. 


Summary 


Bandwidth correction requires estimating the number of independent frequencies. For uniformly spaced, 
dense trial frequencies (and arbitrarily spaced time samples), we approximate the number of independent 
frequencies, M, with eq. (11.4). We may think loosely of Af as the difference in frequency required for 0 to 
become independent of its predecessor. Therefore, for nonuniformly spaced trial frequencies, we must 
consider two types of region: (1) where the trial frequency spacing Aftiai < Af, we use eq. (11.4); (2) where 
the trial frequency spacing Afiriai > Af, we approximate M as the number of trial frequencies, eq. (11.5). 


References 


[Pres&Ryb 1989] Press, William H. and George B. Rybicki, Fast Algorithm for Spectral Analysis of 
Unevenly Sampled Data, Astrophysics Journal, 338:277-280, 1989 March 1. 


[2] http://www.ltrr.arizona.edu/~dmeko/notes_6.pdf , retrieved 1/22/2012. 

[3] Press, William H. and Saul A. Teukolsky, Search Algorithm for Weak Periodic Signals in 
Unevenly Spaced Data, Computers in Physics, Nov/Dec 1988, p77. 

[4] Scargle, Jeffry, Studies in Astronomical Time Series Analysis. II. Statistical Aspects of 
Spectral Analysis of Unevenly Spaced Data, Astrophysical Journal, 263:835-853, 
12/15/1982. 

[Lom] Lomb, N. R., Least Squares Frequency Analysis of Unequally Spaced Data, Astrophysics 


and Space Science 39 (1976) 447-462. 
[Schw 1997] Schwarzenberg-Czerny, A., The Correct Probability Distribution for the Phase Dispersion 
Minimization Periodogram, The Astrophysical Journal, 489:941-945, 1997 November 10. 
Analytic Signals and Hilbert Transforms 


Given some real-valued signal, s(f), it is often convenient to write it in “phasor form.” Such uses arise 
in diverse signal processing applications from communication systems to neuroscience. We describe here 
the meaning of “analytic signals,” and some practical computation considerations. This section relies heavily 
on phasor concepts, which you can learn from Funky Electromagnetic Concepts. We proceed along these 
lines: 


e Mathematical definitions and review. 

e The meaning of the analytic signal, A(t). 

e Instantaneous values. 

e Finding A(#) from the signal s(#), theoretically. 

e The special case of zero reference frequency, wo = 0; Hilbert Transform. 

e A simple and reliable numerical computation of A(t) without Hilbert Transforms. 


Definitions, conventions, and review: There are many different conventions in the literature for 
normalization and sign of the Fourier Transform (FT). We define the Fourier Transform such that our basis 
functions are e’’, and our original (possibly complex) signal z(t) is composed from them; this fully defines 
all the normalization and sign conventions: 


For z(t) complex: z(t) = ft Zoe da = Z(@) = =|. z(t) edt 
po IT 3-00 


where Z(w)=F{z(t)} is the Fourier Transform of z(t). 
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Note that we can think of the FT as a phasor-valued function of frequency, and that we use the positive time 
dependence e*™. 


For real-valued signals we use s(t) instead of z(t). For real s(t), the FT is conjugate symmetric: 
S(-@)=S"(@) — for s(t) real . 


This conjugate symmetry for real signals allows us to use a 1-sided FT, where we consider only positive 
frequencies, so that: 


s(t)= ae” sta ao}, which is equivalent to s(t) = [- S(a)e' da, s(t) real . 
0 —00 


Note that a complex signal with no negative frequencies is very different from a real signal which we 
choose to write as a 1-sided FT. We rely on this distinction in the following discussion. 


Analytic signals: Given a real-valued signal, s(t), its phasor form is: 
s= Re{ A(e''| =|A(1)|cos (apt + o(0)) 


where A(t)= |A(t) |e? is a (complex) phasor function of time (11.6) 
@p) =somewhat arbitrary reference frequency . 


Recall that as a phasor, A(f) is complex. The phasor form of s(t) may be convenient when s(f) is band- 
limited (exists only in a well-defined range of frequencies, Figure 11.13 left), or where we are only concerned 
with the components of s(t) in some well-defined range of frequencies. Figure 11.13 shows two 1-sided 
Fourier Transform (FT) examples of S(@), the FT of a hypothetical (real) signal s(?). 


|S(@)| IS@)| 


Wo “ 
Figure 11.13 Example 1-sided FTs of a real signal s(t): (Left) band-limited. (Right) Wideband. 
The @ axis points only to the right, because we need consider only positive frequencies for a 1-sided 
FT. 


In communication systems, @o is the carrier frequency. Note that even in the band-limited case, wo may 
be different than the band center frequency. [For example, in vestigial sideband modulation (VSB), @o is 
close to one edge of the signal band.] Keep in mind throughout this discussion that wo is often chosen to be 
zero, 1.e. the spectrum of s(t) is kept “in place”. 


We start by considering the band-limited case, because it is somewhat simpler. From Figure 11.13 (left), 
we see that our signal s() is not too different from a pure sinusoid at a reference frequency @o, near the 
middle of the band. s(t) and cos(q@ot) might look like Figure 11.14, left. s(t) is a modulation of the pure 
sinusoid, varying somewhat (i.e. perturbing) the amplitude and phase at each instant in time. We define these 
variations as the complex function A(f). When a signal s(f) is real, and A(f) satisfies eq. (11.6), A(t) is called 
the analytic signal for s(t). 
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s(t) |A(a)| Kt) 


1 


1 2 3 4 °5 O 1 2 3 4 °5 O 1 2 3 4 °5 


Figure 11.14 (Left) s(#) (dotted), and the reference sinusoid. (Middle) Magnitude of the analytic 
signal |A(t)|. (Right) Phase of the analytic signal. 


We can visualize A(t) by considering Figure 11.14, left. At t= 0, s(t) is a little bigger than 1, but it is in- 
phase with the reference cosine; this is reflected in the amplitude |A(0)| being slightly greater than 1, and the 
phase ¢(0) =0. At f= 1, the amplitude remains > 1, and ¢is still 0. At t= 2, the amplitude has dropped to 
1, and the phase ¢(2) is now positive (early, or leading). This continues through t=3. Att=4, the amplitude 
drops further to |A(4)| < 1, and the phase is now negative (late, or lagging), ic. f(4) <0. Atrt=5, the 
amplitude remains < 1, while the phase has returned to zero: ¢(5) = 0. Figure 11.14, middle and right, are 
plots of these amplitudes and phases. 


Instantaneous values: When a signal has a clear oscillatory behavior, one can meaningfully define 
instantaneous values of phase, frequency, and amplitude. Note that the frequency of a sinusoid (in rad/s) is 
the rate of change of the phase (in rad) with time. A general signal s(t), has a varying phase «(f), aka an 
instantaneous phase. Therefore, we can define an instantaneous frequency, as well: 


d(ph 
phase = at + d(t) > Ot) = (phase) = @ + “ : 
t t 


Such an instantaneous frequency is more meaningful when |A(p)| is fairly constant over one or more periods. 
For example, in FM radio (frequency modulation), |A(4| is constant for all time, and all of the information is 
encoded in the instantaneous frequency. 


Finally, we similarly define the instantaneous amplitude of a signal s(t) as |A()|. This is more 
meaningful when |A(d)| is fairly constant over one or more cycles of oscillation. The instantaneous amplitude 
is the “envelope” which bounds the oscillations of s(f) (Figure 11.14, middle). By construction, |s(4| < |A(| 
everywhere. 


Finding A(t) from s(t): Given an arbitrary s(t), how do we find its (complex) analytic signal, A(f)? 
First, we see that the defining eq. (11.6), s(t) = Re{ A(ne'| , is underdetermined, since A(t) has two real 


components, but is constrained by only the one equation. Therefore, if A(#) is to be unique, we must further 
constrain it. 


As a simple starting point, suppose s(f) is a pure cosine (we will generalize shortly). Then: 
S(t) =Cos @t = Re{I cia} where A(t)=1. 
If instead, s(t) has a phase offset 0, then: 
s(t) = cos( apt + 0) = Re{ eel} where A(t)= ec? =cosO+isind. 


(Note that 9 = 0 reproduces the pure-cosine example above.) Thus, in the case of a pure sinusoid, A = A(t) 


is the (constant) phasor for the sinusoid s(#), and the imaginary part of A is the same sinusoid delayed by %4 
cycle (90°): 


Re{ A} =cos8, Im{A}=cos(@-7/4). 


In Fourier space, the real and imaginary parts of A are simply related. Recall that delaying a sinusoid by 4 
cycle multiplies its Fourier component by —i (for w > 0). Therefore: 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 205 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


F {Im(A())}=-iF {Re(A@)}, — 1-sided FT, @ > 0. 


We now generalize our pure sinusoid example to an arbitrary signal, which can be thought of as a linear 
combination sinusoids. The above relation is linear, so it holds for any linear combination of sinusoids, i.e. 
it holds for any real signal s(t). This means that, by construction, the imaginary part of A(#) has exactly the 
same magnitude spectrum as the real part of A(t). Also, the imaginary part has a phase spectrum which is 
everywhere % cycle delayed from the phase spectrum of the real part. This is the relationship that uniquely 
defines the analytic signal A(f) that corresponds to a given real signal s(¢) and a given reference frequency 
jo. From this relation, we can solve for Im{A(f)} explicitly as a functional of Re{A()}: 


Im{A()} =F! {-iF {Re(AM)}, I-sided FT, o>0. (11.7) 


This relation defines the Hilbert Transform (HT) from Re{A(f)} to Im{A(p}. 


The Hilbert Transform of s(f) is a function H(f) that has all the Fourier components of s(2), 
but delayed in phase by 4 cycle (90°). 


Note that the Hilbert transform takes a function of time into another function of time (in contrast to the Fourier 
Transform that takes a function of time into a function of frequency). Since the FT is linear, eq. (11.7) shows 
that the HT is also linear. The Hilbert Transform can be shown to be given by the time-domain form: 


eat) 


00 t-t' 


H{s(t)} = H(t)= “PV where PV =Principal Value . 
(The integrand blows up at ? = ¢, which is why we need the Principal Value to make the integral well- 
defined.) We now easily show that the inverse Hilbert transform is the negative of the forward transform: 
7 1 H(t 
HH (p)} = s(t)=- ay a a 


—37 


where PV = Principal Value . 


We see this because the Hilbert Transform shifts the phase of every sinusoid by 90°. Therefore, two Hilbert 
transforms shifts the phase by 180°, equivalent to negating every sinusoid, which is equivalent to negating 
the original signal. Putting in a minus sign then restores the original signal. 


Equivalently, the HT multiplies each Fourier component (@ > 0) by -i. Then 7{H()} multiplies each 
component by (-i)? =—1. Thus, H{H[ s(#) ]} =—s(2), and therefore H-! = —H. 


Analytic signal relative to mo = 0: Some analysts prefer not to remove a reference frequency e!“’ 
from the signal, and instead include all of the phase in A(t); this is equivalent to choosing wo = 0: 


s(t) = Re{A(t)} 7 |A(t)|cos(g(t)) : 
Since s(t) = Re{A(t)} is given, we can now find Im{A(f)} explicitly from (11.7): 
Im{A()}=F {iF {s@}=H{s} — L-sided FT, @ > 0. 


In other words: 


For @o = 0, A(f) is just the complex phasor factors for s(t), without taking any real part. 


If s(t) is dominated by a single frequency @, then ¢(f) contains a fairly steady phase ramp that is close to 
At) = wt (Figure 11.15). We can use the phase function ((/) to estimate the frequency w by simply taking 
the average phase rate over our sample interval: 


t 


est 
end 
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Figure 11.15 Phase ramp of a perturbed sinusoid, and the estimate of wo. 


Efficient numerical computation of A(t): The traditional way to find A(f) is to use a discrete Hilbert 
Transform to evaluate the defining integral. (This is a standard function in scientific software packages.) 
The discrete Hilbert Transform (DHT) is often implemented by taking a DFT, manipulating it, and then 
inverse FT back to the time domain. This can be seen by recasting our above (1-sided DFT) description of 
the Hilbert Transform (HT) into a 2-sided DFT form. 


Recall that in the 1-sided DFT for a real signal s(t), the frequencies are always positive, w > 0, and S(@) 
is just a phasor-valued function of frequency. To recover the real signal from phasors, we must take a real- 
part, Re{ }. In the 2-sided DFT, we instead arrive at the real part by adding the complex conjugate of all the 
phasor factors: 


s()=2[ do Re{ S(a)e"@| = s)= [dol Saye" +S° (oe ]. 


However, to achieve a 2-sided FT, we rewrite the second term as a negative frequency. Then the integral 
spans both positive and negative frequencies: 


= [i score’ ah: aha Se =s" ee, 


For a complex signal, z(t), only a 2-sided FT exists (a 1-sided FT is not generally possible). Then there 
is no symmetry or relation between positive and negative frequencies. 


We now describe a simple, efficient, and stable, purely time-domain algorithm for finding A(t) from a 
band-limited s(t). This algorithm is sometimes more efficient than the DFT-based approach. It is especially 
useful when the data must be downsampled (converted to a lower sampling rate by keeping only every n™ 
sample, called decimating). Even though s(f) is real, the algorithm uses complex arithmetic throughout. 


|S(@)| passband |S(@)| |S(@)| 


a a | 


T T T > 7 > 
1 Buy ) @ Did 0 Go, ° © 0 Bae © 


mid 


Figure 11.16 (Left) 1-sided FT of s(t), and (middle) its 2-sided equivalent. (Right) 2-sided FT of 
A(t). 


Figure 11.16 shows a 1-sided FT for a real s(#), along with its 2-sided FT equivalent, and the 2-sided FT 
of the desired complex A(t). We define @nig as the midpoint of the signal band (this is not @o, which we take 
to be zero for illustration). The question is: how do we efficiently go from the middle diagram to the right 
diagram? In other words, how do we keep just the positive frequency half of the 2-sided spectrum? Figure 
11.17 illustrates the simple steps to achieve this: 


e Rotate the spectrum down by @miq (downconvert). 
e Low-pass filter around the downconverted signal band. 


e (Optional) Decimate (downsample). 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 207 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


e Rotate the spectrum back up by @mia (upconvert). 


This results in a complex function of time whose 2-sided spectrum has only positive frequencies; in other 
words, it is exactly the analytic signal A(?). 


2-low- 1. downconvert 


7\_ | pass filter 7-7y 
i\ vs 


4. upconvert 


+ 


ZO nia 0 m 0 
Figure 11.17 (Left) To find A(‘): 1. downconvert; 2. low-pass filter; (Right) 4. upconvert. 


Step 1: Downconvert: Numerically, we downconvert in the time domain by multiplying by 
exp(—i@miat). This simply subtracts @mia from the frequency of each component in the spectrum: 


For every @: S(@)ei™ emia? = § (cael @nia ue 
Note that both positive and negative frequencies are shifted to the left (more negative) in frequency. In the 
time domain, we construct the complex downconverted signal for each sample time ¢; as: 


Zdown (tj) = s(t )exp(—i@piat j) = S(t j)C08(@miat j ) - isin( Opiat ;) ‘ 


Step 2: Low-pass filter: Low pass filters are easily implemented as Finite Impulse Response (FIR) 
filters, with symmetric filter coefficients [Ham chap. 6, 7]. In the time domain: 


m 
Adown (tj) =2 > CKZdown(tj+k) Where 2m+1= the number of filter coefficients 


k=-m 
c; = filter coefficients 
The leading factor of 2 is to restore the full amplitude to A(#) after filtering out half the frequency components. 


Step 3: (Optional) Decimate: We now have a (complex) low-pass signal whose full (2-sided) bandwidth 
is just that of our desired signal band. If desired, we can now downsample (decimate), by simply keeping 
every n" sample. In other words, our low-pass filter acts as both a pass-band filter for the desired signal, and 
an anti-aliasing filter for downsampling. Two for the price of one. 


Step 4: Upconvert: We now restore our complex analytic signal to a reference frequency of wo = 0 by 
putting the spectrum back where it came from. The key distinction is that after upconverting, there will be 
no components of negative frequency because we filtered them out in Step 2. This provides our desired 
complex analytic signal: 


A(t;) = Aten (tex (i@piat : 


Note that the multiplications above are full complex multiplies, because both Agown and the exponential factor 
are complex. Also, if we want some nonzero wo, we would simply upconvert by (@mid — @o) instead of 
upconverting by @mia. 


Summary 
The analytic signal for a real function s(f) is A(f), and is the complex phasor-form of s(¢) such that: 
s(t) =Re {A(t) exp (i Mt J where @p = reference frequency . 
jo is often chosen to be zero, so that s(t) = Re{A(t)}. This definition does not uniquely define A(A), since A(‘) 


has real and imaginary components, but is constrained by only one equation. The Hilbert Transform of a real 
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function s(t) is H(t), and comprises all the Fourier components of s(t) phase-delayed by z/4 radians (90°). 
We uniquely define A(f) by saying that its imaginary part is the Hilbert Transform of its real part. This gives 
the imaginary part the exact same magnitude spectrum as the real part, but a shifted phase spectrum. 


Analytic signals allow defining instantaneous values of frequency, phase, and amplitude for almost- 
sinusoidal signals. Instantaneous values are useful in many applications, including communication and 
neuron behavior. 


[Ham] Hamming, R. W., Digital Filters, Dover Publications (July 10, 1997), ISBN-13: 978- 
0486650883. 
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12 Period Finding 


A common experimental task is to examine a data sequence for a periodic component, or several of them. 
For example, in astronomy, periodic signals in a star’s brightness point to physics in the star, or to planets 
orbiting the star. In model construction, periodicities in a model’s residual point to improvements to the 
model. 


Often, such a “periodicity” is obscured by noise and other components that are either aperiodic, or of 
vastly different period than the one(s) of interest. Usually, the experimenter does not know the frequency of 
the sought-after component, but has an idea of some range of possible frequencies, which may span several 
orders of magnitude. The experimenters job is then at least two-fold: 


e Determine a statistically valid likelihood that a periodic component exists in the data; and 
e = If it exists, estimate the frequency of the component. 


Other goals may include estimating the shape and the amplitude of such a component (or components). In 
this chapter, we consider some algorithms for making such determinations. We focus here on a single 
periodic component, with a few suggestions for multiple components at the end. 


We discuss and compare periodogram algorithms: Discrete Fourier Transform (DFT), Lomb-Scargle 
(deprecated, but unfortunately still popular), Phase Dispersion Minimization (PDM), Epoch Folding 
(ToBeSupplied), what I call DCS (DC, cosine, sine), and briefly note why Discrete Fourier Transforms are 
often ineffective with non-uniformly sampled data (such as most astronomical data). 


The literature is rife with conflicting definitions, and even misuse of terms. Precise definitions are 
essential to understanding the data, which point to physical mechanisms underlying them. Therefore, we 
strictly follow these definitions: 


alias an artifact of uniform sampling. An alias is specifically not a sideband, though it can be 
considered a limiting case of a correlate. 


amplitude is the usual real amplitude of a basis function: for a sinusoid, it is simply the real amplitude 
of the sinusoid. 


correlate a sinusoidal frequency component that is correlated with another given sinusoidal 
frequency component. The spectral window function describes how different sinusoidal 
components are correlated for a given set of sample times (more later). The appearance of 
a true sinusoid at other periodogram frequencies is called spectral leakage [Scargle 1982 
p837]. 


detection parameter a statistic from the data that can be used to estimate the significance (or confidence 
level) of a potential detection of a signal. 


DFT Discrete Fourier Transform: a unique and invertible transform from the (typically) time- 
domain to the frequency domain: a DFT gives the frequency components comprising a 
signal. 

FFT Fast Fourier Transform: an algorithm for computing some DFTs. Not all DFTs benefit 
from an FFT. 

periodogram a set of detection statistics as a function of frequency. There are several ways to make a 


periodogram, with some choice in which detection statistic to use. Because of these 
choices, and unlike a DFT, a periodogram is neither unique nor invertible. 


power or energy is proportional to (amplitude)*, with normalization varying throughout the literature. 
probability of false alarm (PFA) __ the probability that pure noise will create an artifact looking like a signal. 


sideband a frequency component resulting from modulating (multiplying) a carrier sinusoid by a 
modulating function of time, such as a slowly changing envelope function. 
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spectral window function For sinusoidal detections, a function W(Aq@) that approximates the correlation of 
an amplitude at frequency @ to the amplitude at frequency w + Aw. W() depends only on 
the sample times, and not the measured values. 


Sinusoids, Fourier, Nyquist, and All That 


Many people are familiar with the basics of uniform sampling, DFTs, the Nyquist limit, and frequency 
resolution. This is described in more detail elsewhere in this work (Funky Mathematical Physics Concepts). 
However, most period finding is done on non-uniformly sampled data, is not based on a DFT, and the 
limitations of frequency resolution and Nyquist limit do not apply. 


For example, with a DFT, the fundamental frequency is f; = 1/T, and all frequencies in the transform are 
multiples of that. Thus we say the “frequency resolution” is 1/7. But with a periodogram, the frequency 
spacing is arbitrary, and the resolution is limited only by the signal-to-noise ratio (SNR). Frequencies are 
often determined with uncertainty << 1/T. 


Also, with uniform sampling, there is a Nyquist limit on the maximum frequency that can be detected: 
Jinax = fsamp/2. Higher frequencies are “aliased” down to lower frequencies by sampling, and the higher 
frequencies are irretrievably lost. With non-uniform sampling, there is no sharp cutoff on the highest 
frequency, and fina is on the order of the inverse of the minimum sample-time-intervals [Press and Rybicki 
1989 p277]. 


Overview of Period Finding Algorithms 


All the period-finding algorithms here rely on the statistical method of choosing a probability of false 
alarm a (say, 5%), and finding a critical value of the detection parameter such that pure noise will exceed 
the critical value only a fraction a of the time. Therefore, if a peak exceeds the critical value, we are confident 
at the (1 — a) level that the peak is a signal, and not a fluke of pure noise. Finding the critical value is 
complicated by the presence of many trial frequencies, and correlations between those trials. The presence 
of background signals, other than pure noise, complicates the detection probabilities. Ultimately, shuffle 
simulations can overcome these complications to provide a valid critical detection value for almost any 
background signal. 


It is crucial to distinguish a period-finding detection parameter from a component “amplitude” or 
“power.” A detection parameter is a statistic from the data that can be used to estimate the significance (or 
confidence level) of potential detection of a signal. All detection parameters of all algorithms are necessarily 
dimensionless and invariant to constant scale factors of the data. In contrast, a detected signal amplitude is 
the usual (real) amplitude of a basis function: for a sinusoid, it is simply the amplitude of the sinusoid. There 
is much confusion in the literature on this distinction, and many references use the term “power” to mean 
“detection parameter.” 


We can make a periodogram graph of the detection parameter vs. frequency, or of amplitude vs. 
frequency. 


References 


[Leahy 1983a] D. A. Leahy, et. al., “On Searches for Pulsed Emission with Application to Four Globular 
Cluster X-Ray Sources: NGC 1851, 6441, 6624, and 6712,” Astrophysical Journal, 
266:160-170, 1983 March 1. 


[Scargle 1982] Jeffrey D. Scargle, “Studies in Astronomical Time Series Analysis. II. Statistical Aspects 
of Spectral Analysis of Unevenly Spaced Data,” Astrophysical Journal, 263:835-853, 1982 
December 15. 


Phase Dispersion Minimization 


Some periodic signals are far from sinusoidal. For example, a planet transiting across a star makes a 
small, nearly rectangular dip in the light curve (intensity vs. time) at regular intervals (Figure 12.la). The 
units are arbitrary. A more realistic light curve would have more noise, obscuring the periodicity (Figure 
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12.1b). Zooming in on the data does not help much (Figure 12.1c). Phase Dispersion Minimization is an 
algorithm to identify arbitrary wave shape periodic functions (not just sinusoids), even when such 
periodicities are corrupted by noise and other signals. PDM creates a periodogram, a graph of a detection 
parameter vs. frequency. Peaks in the periodogram indicate periodic components of the signal (Figure 12.1d). 
The peak at f = .01 clearly indicates the nearly-invisible periodic component (the time units are arbitrary). 
(The subharmonic at f= .005, and the harmonics at f= .02, .03, and .04 are also clearly visible. We discuss 
sub/harmonics later.) 


By its nature (as with epoch folding), PDM also picks up subharmonics of a periodic signal, though 
subharmonics weaken with their order. PDM is invariant to constant scale factors in either time or intensity, 
and also to DC (constant) offsets in time or intensity. 


A straightforward extension of the basic PDM allows for uncertainties to be included in measurements. 
(To our knowledge, including uncertainties is not addressed in the literature.) 


For illustration, we take the independent variable to be time, and the dependent variable to be intensity, 
but of course, any dependent variables can be used. The measurements are triples (fj, yi, ui), where ¢; are the 
independent variables, y; are the measured quantities, and u; are the uncertainties for each measurement. The 
sample times are arbitrary, and often nonuniform, however they should be dense enough to have several 
measurements in each period of the sought-after signal. 
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Figure 12.1 (a) Simulated nonsinusoidal periodic light curve. (b) Same as (a) but with more noise. 
(c) Zoom in on data. (d) PDM periodogram of (b) or (c). 


PDM Algorithm 


A PDM periodogram computes a detection parameter for each trial frequency. The trial frequencies are 
arbitrary, but are usually uniformly spaced over an interval of interest. The frequency spacing is also 
arbitrary, but periodograms typically have 100-400 frequencies. Sometimes, periodograms are graphed as a 
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function of period, instead of frequency. However, we think frequency plots provide better information, 
since physical phenomena such as signal modulation have readily apparent frequency structure. 


For a given trial frequency, the data are folded at the trial frequency, and binned into r bins (typically 
10-20). “Folding” means replacing each data sample time ¢; with t; mod T (T= 1/f). The model value for 
each bin is the (weighted) average of the points in the bin. Figure 12.2a shows a significant trial frequency 
that reveals a periodicity, and (b) shows an insignificant one. Figure 12.3 shows the folded and fitted result 
for the data of Figure 12.1 


N = 340, foldf = 0.005 


N = 340, foldf = 0.0367 600 


400} 


200} 
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Figure 12.2 The red line is the best PDM model fit to the folded data. (a) PDM data points and a 
good fit, r= 20. (b) A bad fit shows wide dispersion in each bin, and a fairly flat “fit”. 


self-generated, N = 400, foldf = 0.01 


sample value 
sample value PDM fitted model 


time 


Figure 12.3 A fit to the folded data clearly shows the periodic component of the signal. 


The PDM algorithm is exactly equivalent to a linear fit of an orthogonal set of r periodic rectangular 
pulses to the data (Figure 12.4). Therefore, the results are amenable to standard analysis of variance 
(ANOVA). Across the PDM literature, there are multiple detection parameters in use, but we recommend 
the standard ANOVA F-statistic (detailed below), which is dimensionless and well-understood. Critical F 
values depend on gaussian residuals, which most data do not have. However shuffle simulations allow for a 
wide variety of detection parameters, and also accommodate non-gaussian residuals. In practice, the F 
Statistic is very effective on arbitrary residuals when its critical values are determined from shuffle 
simulations, rather than gaussian theory. [Somebody discusses shuffle simulations ??]. For many analyses, 
the periodogram peaks are so obvious that quantitative critical values are not needed. 
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Figure 12.4 Basis functions for the equivalent multiple linear regression fit. 


The regression model in PDM comprises the bin averages of the folded/binned data. One early detection 
parameter is the pooled variance of all the bins, s? [Stell 1978 eq. 2]; the best fit (most likely period) is that 
with the smallest s?, meaning measurements in the same bin (similar “phase’”) are similar to each other. 


ANOVA F-statistic: Instead of s*, we recommend the standard ANOVA F-statistic as the detection 
parameter. The general theory of ANOVA proves the sum-of-squares identity: the total variation from the 
average (SST) equals the variation of the model (SSA), plus the variation in unmodeled residuals (SSE) 
[W&MB8 Ch 13]. For the (deprecated) unweighted algorithm (no uncertainties): 


SST = (SSA + SSE where 
total model residual 
n 
total: SST = V(x - y) 
i=l 
r (12.1) 
model: SSA= Yn, (¥,-F) ny, = # points in bin b 
b=l 
r Ww 5 
residual: SSE = > DY (5 = Vp) Ypj = the j” point in the b™ bin 
b=1 j=l 


For each bin b, the model value is Ymnoq,; = ¥,- SST = SSA + SSE is an exact mathematical identity for all 


linear fits with a zero residual sum: a =0, with the residuals ¢; defined as: 
&i = Yi — Yod,i = Yi — Yori) where  b(i) =bin for point. 


The PDM basis functions automatically remove any DC (constant offset) from the signal, so the residual sum 
is zero; it is not necessary to subtract out the average before computing the periodogram, and doing so has 
no effect. However, some applications subtract any straight-line trend from the data before processing (more 
later). 


Many data sets have individual uncertainties (heteroskedastic data): the uncertainty of each point y; is 
u;. To take account of these unequal uncertainties, we compute the sums-of-squares SST, SSA, and SSE from 
standard formulas for weighted estimates (see Funky Mathematical Physics Concepts). For SST: 


n n n n 

—\2 2 =; = 2 2 —2 —2 2 <2 

SST =) w;(y;-¥) =>w,(»; —2yiyt+y )=Yowy; — My" + Vy => wiy; ~My 
i=l i=l i=l i=l 


n 
where w;= a V,= ie 
i=l 
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The last expression is efficient for computer evaluation. 
For SSA: 


Np 


SSA = ea 7 FY val yy). Ve = 20: 
J= 


As a check on this, we set u; = | for all measurements, and we should recover the unweighted formula (12.1): 


Vi =17 417 4.17% = 


bo 
as required. 


For SSE, note that in each bin b, Ymoa,; = Yp- Then the SSE for bin D is: 


Np My Np 
= 2 = 2 2 
SSE, = Ym (94; - To) = Dy (91; — 2195 + Io) = Do wopd5y” Mano” + Vos 
j=l j=l j= 
Np 
2 = 2 
=) Wj ¥oj —VipYp - 
j=l 


Then the full formulas, in efficient form for computing, are: 


n r r nN 
SST =) wy? -V,s’, SSA=>°V,,(9,-F), SSE=>D > je — Vien 
i=l b=l b=1\ j=l 


ny, 


Np 

where u; = uncertainties, w, = Vip = dr Voy, = Ym p= Ie 
Vip 
j=l 


SST = SSA+ SSE . 


When all the uw; are equal, this reduces to (12.1). (Numerical experiments on real data, with millions of trial 
frequencies, confirm that the implementation satisfies this final equation, the weighted ANOVA sum-of- 
squares identity, to within my detection limit of | part in 10!7.) 


If the uncertainties are accurate, then SST has degrees of freedom dof =n — 1, SSA has dof = r— 1, and 
SSE has dof = n—r. In pure gaussian noise (no signal), the detection statistic D(f) follows an F,-1, n+ 
distribution: 


SSA/ (r— ) 


D aes = Sas 
(f= SSE/(n— A r—-l,n-r 


where f =—= the trial frequency . 


As with all periodograms, there are multiple independent frequencies, so the false alarm probability (FAP = 
a) for the entire periodogram is higher than that for a single frequency. As noted in the section “ANOVA 
with Uncertainties,” this standard F-test can be used as an approximate test for detection of a signal. 
However, the F critical values will be approximate, and therefore so will the p-value. 


[Schwa 1998 p832]. However, we dispute his claim [Schwa 1998 p833] that the “accuracy” of random 
number generators is not well tested. Using a 64-bit uniform generator, with a recurrence time of ~10!%, I 
have verified some detailed statistics of bin counts in uniformly sized bins. From uniform random numbers, 
exact theoretical transforms produce other needed distributions. 


[The weighted ANOVA sum-of-squares identity holds only when the weighted sum-of-residuals = 0: 
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n 
> we; =0. 
i=l 


In general regression, this is usually enforced with a fit parameter bo added as a constant to the model: yinog = 
bo + ..... However, the Phase Dispersion Minimization fit functions are flat-topped rectangular pulses, so 
they already include any constant offset in the fit. Adding a bo parameter would be redundant, and therefore 
would make the fit unstable. Do not do this.] 


PDM Computer Implementation Tips 


It is highly inefficient to actually fold the data and sort it for each trial frequency in the PDM 
periodogram. The sorting is at least O(n log n) (and often implemented as O(n7)), and would be used for each 
trial frequency. A small addition is the per-bin statistics calculation. The total computational cost is then, at 
best, O( (n log2n +1r)n,), and often worse. For thousands of points and frequencies, this could be prohibitive. 


Instead, for each trial frequency, make a single pass through the data, and keep statistical running sums 
for each of the r bins. There is no sorting. At the end, compute the final statistics for each bin from the 
running sums. This is O(7 +r), for a final cost of O( (n + r)n,). This reduces computation by at least a factor 
of logs n (and maybe n, possibly thousands). 


PDM On Data With Trends 


Sometimes data includes a straight-line trend (Figure 12.5). Since the PDM fit follows such a trend, it 
distorts the detection statistic from that for truly periodic components. Figure 12.6a shows a periodogram 
for “untrended” data, i.e. with the straight-line trend first fitted and subtracted. The result is a reasonable 
measure of truly period components. In contrast, the raw periodogram (Figure 12.6b), made from the original 
data with the trend, is very different at most frequencies. Therefore, if you are looking for periodic signals, 
and have a significant trend in your data, we recommend you “untrend” your data before making the PDM 
periodogram. 


For reference, here are the straight-line fit equations for uncertainty-weighted data [Strutz 7.16 p177]. 
We start with measurement triples (¢;, yi, ui), where f; are the independent variables, y; are the measured 
quantities, and u; are the uncertainties for each measurement. Then: 


Ymod(*) = by + Bx, w; = u;. 
n n n n n 
Y= ym. S,= es Sy= wis Sx = DY win’, Sy = > wins 
i=l i=l i=l i=] i=] 
_ SySy — SS _ViSxy — SxS y 
_ Vig =5 7 Visgy -$,? 
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3 le=9 


sample value 
sample value PDM fitted model 


i i i i i 
0 1000 2000 3000 4000 5000 6000 
time +2.453e6 


Figure 12.5 Exampe PDM fit (red, at the peak frequency) with folded data, when the data includes 
a straight-line trend. r= 20 bins. (The axes are not important.) 


\lobackup\pep\libresid.dat, N=2737, r=20, sample value 
nf=400 


140 140 
120 120 
100 100 
8 8 
” bd 
5 80 5 80 
a a 
i i 
~ i" 
2 3 
xz 60 xz 60 
= = 
2 2 
40} 40 
20 20 
0 ) . 
0.000 0.002 0.004 0.006 0.008 0.010 0.000 0.002 0.004 0.006 0.008 0.010 
frequency (cyc/time) frequency (cyc/time) 
(a) (b) 


Figure 12.6 (a) PDM periodogram with the trend removed. (b) Periodogram with the trend intact 
is vastly different at most frequencies. 


Subharmonic Response 


Any signal periodic in T = I/fis also periods in 27, 37, .... Therefore, PDM produces a peak at each 
subharmonic of the periodicity. At least one subharmonic is visible in Figure 12.1d, and possibly 3. Often 
5 or more subharmonics are visible. The response weakens slowly with each subharmonic order (increasing 
period) because data points from a wider range of “phase” (wider fraction of a period) are included in each 
bin as the subharmonic order increases. In effect, the bins become wider: they occupy a larger fraction of 
the true (fundamental) period. 


Harmonic response 


Harmonics of the fundamental frequency are sometimes visible as peaks in the periodogram, but they 
decay quickly with harmonic order. Harmonics are usually less visible than subharmonics. Figure 12.7a 
shows why: a fit to a harmonic must compromise between the two phases that are mapped to the same folded 
interval. The best fit is halfway between the two groups of points, and so not very good. Figure 12.7b is the 
2™ harmonic fit to the data of Figure 12.1. It’s amplitude is roughly half that of the fundamental in Figure 
12.4. 
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n_01 sim.dat, N = 400, foldf = 0.02 n_1_sim.dat, N = 400, foldf = 0.02 


sample value 

sample value PDM fitted model 
sample value 

sample value PDM fitted model 


(a) - (b) “ 


Figure 12.7 (a) Fit to the simulated low-noise transit data. (b) Fit to the simulated high-noise transit 
data. 


Spectral window function 


There is often confusion about whether the spectral window function (SWF) affects PDM periodograms. 
It does not. The SWF is computed explicitly for sinusoidal fits, such as the generalized sinusoidal 
periodogram [ref??], or the now-deprecated Lomb-Scargle periodogram. 


NB: The “spectral window function” is not a windowing function for smoothing data, 


or for smoothing a spectrum. Those are unrelated to the “spectral window function,” 
which is computed from the sample times of time-series data. 
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13. Numerical Analysis 


Round-Off Error, And How to Reduce It 
Floating point numbers are stored in a form of scientific notation, with a mantissa and exponent. E.¢., 
1.23 x 10° has mantissa m= 1.23 andexponent e=45. 


Computer floating point stores only a finite number of digits. ‘float’ (aka single-precision) typically stores 
at least 6 digits; ‘double’ typically stores at least 15 digits. We’ll work out some examples in 6-digit decimal 
scientific notation; actual floating point numbers are stored in a binary form, but the concepts remain the 
same. (See “IEEE Floating Point” in this document.) 


Precision loss due to summation: Adding floating point numbers with different exponents results in 
round-off error: 


1.23456x102 > 1.234 56 x 107 
+ 6.111 11 x 10° + 0.061 111 1 x 10? 
= 1.29567x 107 where 0.000001 1 of the result is lost, 


because the computer can only store 6 digits. (Similar round-off error occurs if the exponent of the result is 
bigger than both of the addend exponents.) When adding many numbers of similar magnitude (as is common 
in statistical calculations), the round-off error can be quite significant: 


float sum = 1.23456789; // Demonstrate precision loss in sums 
prince ("S.9F\n", sum); // show # significant digits 
for(i = 25; a < 10000; i++) 
sum += 1.23456789; 
printf ("Sum of 10,000 = %.9£\n", sum) ; 


1.234567881 8 Significant digits 
Sum of 10,000 = 12343.28 only 4 significant digits 


We lost about 1 digit of accuracy for each power of 10 in n, the number of terms summed. [.e.: 
digit-loss ~log,,n . 


When summing numbers of different magnitudes, you get a better answer by adding the small numbers first, 
and the larger ones later. This minimizes the round-off error on each addition, and shows that: 


Finite-precision addition is not associative). 


E.g., consider summing 1/n for 1,000,000 integers. We do it in both single- and double-precision, so 
you can see the error: 


float sum = 0.; 

double dsum = 0.; 

// sum the inverses of the first 1 million integers, in order 

for(i.=]= 17 2. <= 1000000; a++) 

sum +=: 1./i, dsum += 1../ a3 

printf("sum: %f\ndsum: %f. Relativ rror = %.2f %%\n", 
sum, dsum, (dsum-sum) /dsum) ; 


sim: 14, 357358 
dsum: 14.392727. Relativ rror = 0.002457 


This was summed in the worst possible order: largest to smallest, and (in single-precision) we lose about 5 
digits of accuracy, leaving only 3 digits. Now sum in reverse (smallest to largest): 


float sumb = 0.; 


double dsumb = 0.; 
for(i = 1000000; i >= 1; i--) 
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sumb += 1./i, dsumb += 1./i; 
printf(" sumb: %f\ndsumb: %f. Relativ rror = %.6f\n", 
sumb, dsumb, (dsumb-sumb) /dsumb) ; 


sumb: 14.392652 
dsumb: 14.392727. Relativ rror = 0.000005 


The single-precision sum is now good to 5 digits, losing only 1 or 2. 


[In my research, I needed to fit a polynomial to 6000 data points, which involves many sums of 6000 terms, 
and then solving linear equations. I needed 13 digits of accuracy, which easily fits in double-precision (‘double’, 
15-17 decimal digits). However, the precision loss due to summing was over 3 digits, and my results failed. Simply 
changing the sums to ‘long double’, then converting the sums back to ‘double’, and doing all other calculations in 
‘double’ solved the problem. The dominant loss was in the sums, not in solving the equations. ] 


Summing from smallest to largest is very important for evaluating polynomials, which are widely used 
for transcendental functions. Suppose we have a 5" order polynomial, f (t): 


f(t)=ay +a,x+ yx" ax +a + asx” ; 
which might suggest a computer implementation as : 
£ =a0 + al*e + aZ*ert + aS* bere, + ade EE + Bae Ere EV ete 


Typically, the terms get progressively smaller with higher order. Then the above sequence is in the worst 
order: biggest to smallest. (It also takes 15 multiplies.) It is more accurate (and faster) to evaluate the 
polynomial as: 


fF = (C(CCaS*t + a4) *t + a3) *t + aZ)*t + al) *t + ad 


This form adds small terms of comparable size first, progressing to larger ones, and requires only 5 multiplies. 


How To Extend Precision In Sums Without Using Higher Precision 
Variables 


(Handy for statistical calculations): You can avoid round-off error in sums without using higher 
precision variables with a simple trick. For example, let’s sum an array of n numbers: 


n 
Of; sap seal) sum += ail; 


This suffers from precision loss, as described above. The trick is to actually measure the round-off error of 
each addition, and save that error for the next iteration. This is called a compensated add: 


sum = 0.; 

errer = 0.; // the carry-in from the last add 

fer(i = 0; a< ne a+) 

{ newsum = sum + (a[i] + error); // include the lost part of prev add 
diff = newsum - sum; // what was really added 
error = (a[li] + error) = diff; // the round-off error 
sum = newsum; 


} 


The ‘error’ variable is the round-off error, and is always small compared to the ‘sum’. Keeping track of it 
effectively doubles the number of accurate digits in the sum. After each addition, ‘error’ tells you how far 
off your sum is. For most practical purposes, this eliminates any precision loss due to sums. Let’s try 
summing the inverses of integers again, in the “bad” order, but with this trick: 


float mnewsum, diff, sum = 0., error = 0.; 

for(a.= ls 2 <= 1000000; af) 

{ newsum = sum + (1./i + error); 
diff = newsum - sum; // what was really added 
error = (1.3/1 + error) = ditt; // the round-off error 
sum = newsum; 
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} 
printf(" sum: %f\ndsumb: %f. Relativ rror = %.6f, error = %g\n", 
sum, dsumb, (dsumb-sum)/dsumb, error); 


sum: 14.392727 
dsumb: 14.392727. Relativ rror = -0.000000, error = -1.75335e-07 


As claimed, the sum is essentially perfect. 


Numerical Integration 


The above method of sums is extremely valuable in numerical integration. Typically, for accurate 
numerical integration, one must carefully choose an integration step size: the increment by which you change 
the variable of integration. E.g., in time-step integration, it is the time step-size. If you make the step size 
too big, accuracy suffers because the “rectangles” (or other approximations) under the curve don’t follow the 
curve well. If you make the step size too small, accuracy suffers because you’re adding tiny increments to 
large numbers, and the round-off error is large. You must “thread the needle” of step-size, getting it “just 
right” for best accuracy. This fact is independent of the integration interpolation method: trapezoidal, 
quadratic, Runge-Kutta ... . 


By virtually eliminating round-off error in the sums (using the method above), you eliminate the lower- 
bound on step size. You can then choose a small step-size, and be confident your answer is right. It might 
take more computer time, but integrating 5 times slower and getting the right answer is vastly better than 
integrating 5 times faster and getting the wrong answer. 


Sequences of Real Numbers 


Suppose we want to generate the sequence 2.01, 2.02, ... 2.99, 3.00. A simple (but flawed) approach is 
this: 


float s; 
for(s = 2.017 -s-<= 3.3 8 4#= 0.02) 


The problem with this is round-off error: 0.01 is inexact in binary (has round-off error). This error 
accumulates 100 times in the above loop, making the last value 100! ~10 times more wrong than the first. 
In fact, the loop might run 101 times instead of 100. The fix is to use integers where possible, because they 
are exact: 


Eleat s; 

pigs ath 

for(i = 2017 i <= 300% +42) 
{ s = i/100.; 

} 


When the increment is itself a variable, note that multiplying a real by an integer incurs only a single 
round-off error: 


real s, base, incr; 
int aie 
for(i = 1; i <= max; ++i) s = base + i*incr; 


Hence, every number in the sequence has only two round-off errors, from the multiply and the add. 


Root Finding 


In general, a root of a function f(x) is a value of x for which f(x) = 0. It is often not possible to find the 
roots analytically, and it must be done numerically. [TBS: binary search] 


Simple Iteration Equation 


Some forms of f(_) make root finding easy and fast; if you can rewrite the equation in this form: 
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f(x) =0 > X= g(x) 


then you may be able to iterate, using each value of g() as the new estimate of the root, r. 


This is the simplest method of root finding, and generally the slowest to converge. 


It may be suitable if you have only a few thousand solutions to compute, but may be too slow for millions of 
calculations. 


You start with a guess that is close to the root, call it ro. Then 
4 = 9(%), rh = 9(h), 5) Thal = g(r,) 


If g() has the right property (specifically, |g’(x)| < 1 near the root) this sequence will converge to the solution. 
We describe this necessary property through some examples. Suppose we wish to solve Vx /2—x=0 


numerically (exact is x = 0.25). First, we re-arrange it to isolate x on the left side: x = a (Figure 13.1a). 


1+ 1+ 4x2 
YEX 
0.5 if 
x 
(a) 1 (b) 1 
Figure 13.1 Two iteration equations for the same problem. (a) Sequence converges. (b) Sequence 
fails. 
From the graph, we might guess 7» ~ 0.2. Then we would find, 
n= V0.2 /2 =0.2236, = fn /2 = 0.2364, r, = 0.2431, ry = 0.2465, 1, = 0.2483, 


Re = 0.2491, r =0.2496 
We see that the iterations approach the exact answer of 0.25. 


But we could have re-arranged the equation differently: 2x = Vx, x=4x° (Figure 13.1b). Starting 


with the same guess x = 0.2, we get this sequence: 
7, = V0.2 /2=0.16, ry = afr, /2 = 0.1024, r, = 0.0419, 14, = 0.0070 


It is not converging on the nearby root; the sequence diverges away from it. So what’s the difference? Look 
at a graph of what’s happening, magnified around the equality: 
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(vx)/2 Ax? 
y=x 
0.25+ 0.25 
ry 
1 
L410 
AH | +X Ry } } x 
(a) 0.2 0.25 (b) 0.2 0.25 


Figure 13.2 (a) Sequence converges. (b) Sequence fails. 


When the curve is flatter than y = x (above left), then trial roots that are too small get bigger, and trial 
roots that are too big get smaller. So iteration approaches the root. When the curve is steeper than y = x 
(above right), trial roots that are too small get even smaller, too big get even bigger; the opposite of what we 
want. So for positive slope curves, the condition for convergence is 


Ay _y3)-y@) -, 


in the region r—|y—r]<s<r4|m-r|;-r]<|m-r 
Ar S—r 


where ris the exact root; 1 is the first guess . 


Consider another case, where the curve has negative slope. Suppose we wish to solve cos) x—x=0, 


(x in radians). We re-write it as x= cos! x. On the other hand, we could take the cosine of both sides and 
get an equivalent equation: x =cosx. Which will converge? Again look at the graphs: 


4 Cos x cos lx bie? 
cos !x : i 
0.7397 0.7397 
cos x ie) 
+ > Bi t >X 7 t >X 
‘0.739 ° 0.739 


Figure 13.3 (Left) cos and cos"! are superficially similar. (Middle) cos converges everywhere. 
(Right) cos“! fails everywhere. 


So long as the magnitude of the slope < | in the neighborhood of the solution, the iterations converge. When 
the magnitude of the slope > 1, they diverge. We can now generalize to all curves of any slope: 


The general condition for convergence is 
Ay “= y(r) 


<1, in the region r—-|m—rl<s<rt|y-r]. 
Ar 


S-Tr 


The flatter the curve, the faster the convergence. 


Given this, we could have easily predicted that the converging form of our iteration equation is x = cos x, 
because the slope of cos x is always < 1, and cos"! x is always > 1. Note, however, that if the derivative 
(slope) is > 1/2, then the binary search will be faster than iteration. 

Newton-Raphson Iteration 
The above method of variable iteration is kind of “blind,” in that it doesn’t use any property of the given 


functions to advantage. Newton-Raphson iteration is a method of finding roots that uses the derivative of 
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the given function to provide more reliable and faster convergence. Newton-Raphson uses the original form 
of the equation: f(x) = Vx/2—x=0. The idea is to use the derivative of the function to approximate its 
slope to the root (Figure 13.4a). We start with the same guess, 7 = 0.2. 


Vx/2 -— x 4x°—x 
+ Kw 
tangent 
0 Axf +x 
\ Af tangent 
0.0 AN, Xx + 
(a) 0.1 0.2 0.25 (b) 0.1 0.2 0.25 


Figure 13.4 Illustration of Newton-Raphson for both forms of the root-finding equation. 


A . 
~ = f'\(%) => Ar  — ao (Note f '(7%) < 0) 
1 W279, gpl? or —4y3!2 
fi@srx'?-1 > Ar=—+i ei = i =~ ve 
4 ni 74-1 4r, 1-4, 


Here’s a sample computer program fragment, and its output: 
// Newton-Raphson iteration 


r = 0.2; 
for(i = 1; i < 10; i++) 


{ PSS (2. = 4.4e"sqre(t) ¢ Cle = A2%sere (re) iy 
print’ ("rsd Salef\n"; Df, F) 


, 


} 


rl 0.253 532 216 545 439 2 
12 0.250 012 217 175 258 8 
13 0.250 000 000 149 248 4 
r4 0.250 000 000 000 000 0 


In 4 iterations, we get essentially the exact answer, to double precision accuracy of 16 digits. This is much 
faster than the variable isolation method above. In fact, it illustrates a property of some iterative numerical 
methods called quadratic convergence: 


You can see this clearly above, where 7; has 2 accurate digits, r2 has 4, r3 has 9, and r4 has at least 16 (maybe 
more). Derivation of quadratic convergence?? 


Another advantage of Newton-Raphson over simple iteration is that Newton-Raphson does not have the 
restriction on the slope of any function, as does simple iteration. We can use it just as well on the reverse 
formula (Figure 13.4b): 


2 
fwo= 4x? — x, f(x) =8x-1, Ar=- ao =— ~ . , with these computer results: 
i x— 


rl 0.266 666 666 666 666 7 
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12 0.250 980 392 156 862 7 
13 0.250 003 814 755 474 2 
r4 0.250 000 000 058 207 7 
r5 0.250 000 000 000 000 0 
This converges essentially just as fast as the first form, and clearly shows quadratic convergence. 


If you are an old geek like me, you may remember the iterative method of finding square roots on an old 
4-function calculator: to find Va: divide a by r, then average the result with r. Repeat as needed: 


alr, +1, 


2 


Tha = 


You may now recognize that as Newton-Raphson iteration: 


f(=r-a=0, f(r) =2r, 
2 
Toa) =T, tAr =r, - f@) =r —” Y =1-2 + Lae r+ a 
f(r) 2r, 2 2r, 2 ty 
alr, +r, 
If you are truly a geek, you tried the averaging method for cube roots: 7,,) = a. . While you 


found that it converged, it was very slow; cube-root(16) with ro = 2 gives only 2 digits after 10 iterations. 
Now you know that the proper Newton-Raphson iteration for cube roots is: 


3 3p? 3 


n 


3 
f(n=r-a=0, F'O=3P, ty = B=, - B+ -1{25+5| 


which gives a full 17 digits in 5 iterations for ro = 2, and shows (of course) quadratic convergence: 
rl 2.6666666666666665 


2. 2.52071 TIT 
r3. -2.5198669868999541 
rd. 2.5198421000355395 
r5  2.5198420997897464 


It is possible for Newton-Raphson to cycle endlessly, if the initial estimate of the root is too far off, and 
the function has an inflection point between two successive iterations (Figure 13.5): 


Sf) 
+ 


tangent 


tangent 


Figure 13.5 Failure of Newton-Raphson iteration. 


It is fairly easy to detect this failure in code, and pull in the root estimate (say, with a midpoint method) 
before iterating again. 
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Pseudo-Random Numbers 


We use the term “random number” to mean “pseudo-random number,” for brevity. Uniformly 
distributed random numbers are equally likely to be anywhere in a range, typically (0, 1). 


Uniformly distributed random numbers are the starting point 
for many other statistical applications. 
Computers can easily generate uniformly distributed random numbers. The best generators today are 


based on linear feedback shift registers (LFSR) [Numerical Recipes, 3" ed.]. The old linear-congruential 
generator is: 


// Uniform random value, 0 < v < 1, i.e. on (0,1) exclusive. 
// Numerical Recipes in C, 2nd ed., p284 

static uint32 seed=1; // starting point 

vflt rand_uniform (void) 


{ 


do seed = 1664525L*seed + 10139042231; // period 2%32-1 
while(seed == 0); 
rand_callst+; // count. calls for ,epetition check 


return seed / 4294967296.; 
} // rand_uniform() 


Many algorithms that use random numbers fail on a “random” value of 0 or 1, 
so this generator never returns them. 


After a long simulation with a large number of calls, it’s a good idea to check ‘rand_calls’ to be sure it’s 
< ~400,000,000 = 10% period. This confirms that the generated numbers are essentially random, and not 
predictable. 


Arbitrary distribution random numbers: To generate any distribution from a uniform random 
number: 
R=cdf ee ) where Ris the random variable of the desired distribution 


cdf Pe = inverse of the desired cumulative distribution function of R 
U is a uniform random number on (0,1) 
Figure 13.6 illustrates the process graphically. We can derive it mathematically as follows: recall that the 


cumulative distribution function gives the probability of a random variable being less than or equal to its 
argument: 


cdf y (a) = Pr(X <a) = [- dx pdf y (x) where X isarandom variable . 
cdf!(x) 
pdf(x) 0.57 
cdf(x) 
+ 1 x 
1 i 
< >X < $> XxX 
-0.5 0.5 -0.5 0.5 
-0.5 


Figure 13.6 Steps to generating the probability distribution function (pdf) on the left. 


Also recall that the pdf of a function of a random variable, say F = f(u), is (see Probability and Statistics 
elsewhere in this document): 
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pdf, (x) = pax) where _f (x) = derivative of f(x) . 
f(x) 
Let Q= cdf, '(U). Using pdf, (uw) = 10n [0, 1] 
-l 
PO ee Using © g“"(u) = (4 «0w] euida 3 
f cdfy uy (4 cr a ss 
i ao 
= pdfp(r), as desired. 


Generating Gaussian Random Numbers 
The inverse CDF method is a problem for gaussian random numbers (any many others), because there 


is no closed-form expression for the CDF of a gaussian (or for the CDF~!): 


° 1 op? ‘ 
CDF(a) = dx ——e™* (gaussian) . 
N27 


—o0 


But [Knu] describes a clever way based on polar coordinates to use two uniform random numbers to generate 
a gaussian. He gives the details, but the final result is this: 


gaussian = (J Inu )cos 0 where @ is uniform on (0,277) 
u is uniform on (0,1) 
/* Gaussian random value, 0 mean, unit variance. From Knuth, "The Art of 
Computer Programming, Vol. 2: Seminumerical Algorithms," 2nd Ed., p. 117. 
It is exactly normal if rand_uniform() is uniform. */ 
PUBLIC double rand_gauss (void) 
{ double theta = (2.*M PI) * rand uniform () ; 
return sqrt( <2. * log(rand uniform()) ) * cos (theta); 


} // rand_gauss() 


Note that u must exclude both 0 and 1. 


Generating Poisson Random Numbers 


Poisson random numbers are integers; we say the Poisson distribution is discrete: 


1.00 a 1.00 lara ; eaten 
0.75 0.75 
0.50 0.50 
0.25 0.25 
0 12 3 4 5 " O12 3 4 5 : °O 25 .50 .75 1.00 


Figure 13.7 Example of generating the (discrete) Poisson distribution. 


We can still use the inverse-cdf method to generate them, but in an iterative way. The code starts with a 
helper function, poisson( ), that compute the probability of exactly n events in a Poisson distribution with an 
average of avg events: 


ie Fe ea a a ae a a oe a a 
double poisson ( // Pr(exactly n events in interval) 

double avg, // average events in interval 

int n) // n to compute Pr() of 
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{ double factorial; 
phahe aye 


if(n <= 20) factorial = fact[n]; 
else 
{ factorial = fact [201% 
for(i = 21; i <= ny ++i) factorial *= i; 
} 
return exp(-avg) * pow(avg, n) / factorial; 
} // poisson () 


f¥esesosessessssssssesssesssses sea Sees ssee sess seeeessessssessssseasees assesses 
Generates a Poisson random value (an integer), which must be <= 200. 
Prefix "‘irand ...' emphasizes the discretenéess of the Poisson distribution. 
2 ee Se ee ee ee et * / 
int irand_poisson ( // Poisson random integer <= 200 
double avg) // avg # "events" 

{ 

int ny 

double cpr; // uniform probability 


// Use inverse-cdf(uniform) for Poisson distribution, where 


// inverse-cdf() consists of flat, discontinuous steps 
cpr = rand_uniform(); 

for(i = 0; i <= 200; ++i) // safety limit of 200 

{ cpr -= poisson(avg, 1); 


if(epr <= 0) break; 
} 
return i; // 201 indicates an error 
} // irand_poisson() 


Other example random number generators: TBS. 


Generating Useful, But More Challenging, Random Numbers 


Sometimes you need to generate more complex distributions, such as a combination of a gaussian with 
a uniform background of noise, a “raised gaussian” (Figure 13.8). 


PDF(x) 


gaussian PDF 


uniform PDF 
0-H ?x 


Figure 13.8 Construction of a raised gaussian random variable from a uniform and a gaussian. 


Since this distribution has a uniform “component,” it is only meaningful if it’s limited to some finite “width.” 
To generate distributions like this, you can compose two different distributions, and use the principle: 


The PDF of a random choice of two random variables is the weighted sum of the individual PDFs. 


For example, the PDF for an RV (random variable) which is taken from X 20% of the time, and Y the 
remaining 80% of the time is: 


pdf,(z) =0.2 pdf, (z) + 0.8 pdf, (z) . 
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In this example, the two component distributions are uniform and gaussian. Suppose the uniform part of the 
pdf has amplitude 0.1 over the interval (0, 2). Then it accounts for 0.2 of all the random values. The 
remainder are gaussian, which we take to be mean of 1.0, and o = 0.5. Then the random value can be 
generated from three more-fundamental random values: 

// Raised Gaussian random value: gaussian part: mean=1, sigma=1 

// Uniform part (20% chance): interval (0, 2) 

if (rand uniform() <= 0.2) 

random variable = rand_uniform() *2.0; 
else 


random variable = rand_gauss()*0.5 + 1.0; // mean = 1, sigma = 0.5 
This isn’t quite right, because the gaussian will have values < 0 and > 2, which will give weird “tails” to the 
generated PDF. More complete code would detect those, and regenerate until a valid value results. 
Exact Polynomial Fits 


It’s sometimes handy to make an exact fit of a quadratic, cubic, or quartic polynomial to 3, 4, or 5 data 
points, respectively. 


3 points, 2"" order 4 points, 3 order 5 points, 4° order 


xX XxX xX 


-1 0 1 -1 0 1 2 2 -l 0 1 2 
The quadratic case illustrates the principle simply. We seek a quadratic function: 
y(x) = ax +a,X+ ao 
which exactly fits 3 equally spaced points, at x =—1, x = 0, and x = 1, with value y_1, yo, and yi, respectively 


(shown above). So long as your actual data are equally spaced, you can simply scale and offset to the x 
values —1, 0, and 1. We can directly solve for the coefficients a2, a1, and ao: 


a,(-1)" +a,(-1) +a) =y_, ay 0, +4) =y-, 

a,(0)° +.a,(0) + dy = Yo => dy = Yo 

ay(1)* +a,(1) + dy = y, a+ a+ dy = y 

> ay=(yity)/2-y, a =(-y4)/2, ay = Yo 


Similar formulas for the 3" and 4" order fits yield this code: 


ib SSa SS SS SS Ss SS SS SS Se Sea = = 
// f£it3rd() computes 3rd order fit coefficients. 4 mult/div, 8 adds 
PUBLIC void £it3rd( 
double yml, double y0, double yl, double y2) 
{ 
a0 = y0; 
a2 = (yml + yl)/2. - y0; 
a3 = (2..%yml + y2 = 3.*y0) 765 = a2; 
al = yl - yO - a2 - a3; 
Lb f/f £LESra) 
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Ll ee ee ee ee ee 
// fit4th() computes 4th order fit coefficients. 6 mult/div, 13 add 
PUBLIC void Fit4th( 
double ym2, double yml, double y0, double yl, double y2) 
{ 


bO = yO; 

b4 = (y2 + ym2 - 4*(yml + yl) + 6*y0)/24.; 
b2 = (yml + yl)/2. - yO - b4; 

b3 = (y2 - ym2 - 2.*(yl - yml))/12.; 

bl = (yl - yml1)/2. - b3; 


} ff Prete) 


TBS: Alternative 3 order (4 point) symmetric fit, with x = {-3, -1, 1, 3}. 
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14 Computer Math Internals 


Digital Integer Arithmetic: Two’s Complement 


Two’s complement is a way of representing negative numbers in binary. It is universally used for 
integers, and rarely used for floating point. This section assumes the reader is familiar with positive binary 
numbers and simple binary arithmetic. 


23 22 2! 20 


Most Significant Bit OT 1 0. Least Significant Bit 
(MSB) (LSB) 


Two’s complement uses the most significant bit (MSB) of an integer as a sign bit: zero means the number 
is > 0; 1 means the number is negative. Two’s complement represents non-negative numbers as ordinary 
binary, with the sign bit = 0. Negative numbers have the sign bit = 1, but are stored in a special way: for a 
b-bit word, a negative number n (n < 0) is stored as if it were unsigned with a value of 2? —|n| = 2’+n. This 
is shown below, using a 4-bit “word” as a simple example: 


bits unsigned signed 


0000 0 0 

J, oat 1 1 
0010 2 2 
sign bit 0011 is S 
0100 4 4 

0101 5 5 

0110 6 6 

QL AL 7 7 

1000 8 -8 

1001 cS) d 

1010 10 -6 

1011 aval =5 

1100 ile -4 

1101 13 -3 

L110 14 a2 

Ut 13) = 


With two’s complement, a 4-bit word can store integers from —8 to +7. E.g., —1 is stored as 16-1 = 15. 
The two’s-complement rule is usually defined as follows (which completely obscures the purpose): 


For n < 0, let a= |n| Example: n=—-4,a=4 
Start with the bit pattern for a: 0100 
Complement it (change all Os to 1s and Is to Os): 1011 
Add 1: 1100 


Let’s see how two’s complement works in practice. There are 4 possible addition cases: 


(1) Adding two positive numbers: so long as the result doesn’t overflow, we simply add normally (in 
binary). 


(2) Adding two negative numbers: Recall that when adding unsigned integers, if we overflow our 4 
bits, the “carries” out of the MSB are simply discarded. This means that the result of adding a + c is actually 
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(a +c) mod 16. Now, let n and m be negative numbers in twos complement, so their bit patterns are 16 + n, 
and 16+ m. If we add their bit patterns as unsigned integers, we get 


(16+1)+(16+m) =| 32+(n+m) |mod 16 =16+(n+m), n+m<0 


which is the 2’s complement representation of (n + m) < 0. 


E.g., —2 1110 16 + (-2) 
+ —3 +1101 +16+(C3) 
—5 1011 16 + (-5) 


So with two’s complement, adding negative numbers uses the same algorithm as adding unsigned 
integers! That’s why almost all computers use two’s complement. 


(3) Adding a negative and a positive number, with positive result: 


(16+n)+a=[16+(n+a) |mod16=n+a, n+a>0 
E.g., —2 1110 16 + (-2) 
+ 5 O101l1 + 5 
3 0011 3 


(4) Adding a negative and a positive number, with negative result: 


(16+n)+a=16+(n+a), nt+a<0 


Eg., 6 1010 16+(-6) 
+ 3 OOl1L +3 


—3 1101 16 + (3) 


The computer hardware need not know which numbers are signed, and which are unsigned: it adds the same 
way no matter what. It works the same with subtraction: subtracting two’s complement numbers is the same 
as subtracting unsigned numbers. 


It even works multiplying to the same word size: 


—+: (16 +n)a =| 16a +(na) |mod 16 = 16+ na, n<0,a>0,na<0O 
--: (16+n)(16+m) =[256+16(n+m)+nm |mod 16 = nm, n<0,m<0,nm>0 


In reality, word sizes are usually 32 (or maybe 16) bits. Then in general, we store b-bit negative numbers (n 
<0) as 2’ +n. E.g., for 16 bits, (1 <0) > 65536 +n. 


Two’s complement notation explains why the ranger of signed integers is lopsided: from —2? to +(2 — 1). 


Hexadecimal 


Hexadecimal notation (base-16) is used as a shorthand for writing binary. Each group of 4 bits is written 
as a single hexadecimal digit: 


ee 
ce 


Ce 
D> Poon fom fs Pion fe P| 
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Any binary bit pattern can be written in hexadecimal (or “hex”): 
0110 1100 1010 0011 = 6CA3 . 


Hex letters can be written as either upper or lower case, though upper case is a little more formal, and (we 
think) easier to read. 


For unsigned integers, hexadecimal isn’t just shorthand; it is a true base-16 number: 


10100011= 2’ +2° +2' + 2° =163 So A3= 10-16) +3-16° =163 
base 10 


Two’s complement negative numbers, floating point, and many other bit patterns are often written in hex. 
Two’s complement looks a little funny: 


11110011 = 243— 28 =-13 So F3= 243-167 =-13. 
256 base 10 


In C, hexadecimal numbers start with a “Ox” prefix, e.g. OxA3. In Fortran code, hex uses the Z or X 
notation, e.g. z’A3’ or x’A3’. However, for input/output, the FORMAT edit descriptor uses only Z (because 
X is already used for spaces). 


In the old days, there was such a thing as “octal,” which is groups of 3 bits, or base 8. This is clumsy 
because computer word lengths are usually divisible by 4, but not 3. Octal is obsolete. 


Digital Floating Point 


Computer floating point is essentially scientific notation, but in binary. Briefly, a binary floating point 
number is represented by 3 fields: S, E, and T: a sign bit, an exponent, and a significand (aka mantissa), 
Figure 14.1. The value v of the number is [IEEE-754-2008 p9]: 


v= (-1)%8" (2exponenty (significand) : 
For example, v = —(2'°) 1.625. Details later. 


How Far Can I Go? 


What is the range, in decimal, of numbers that can be represented by the IEEE floating point formats? 
The answer is dominated by the number of bits in the binary exponent. This table shows it: 


Range and Precision of Some Floating Point Formats 


Significant | Smallest Normal Decimal 
Format Bits Magnitude Largest Number | Digits 


IEEE binary32 1.175... x 10°%8 3.402... x 10*38 69 


IEEE binary64 2.225... x 103% 1.797... x 10*308 15-17 
x86 long double 3.362... x 10°” 1.189... x 10*49 18-21 
IEEE binary 128 3.362... x 10°” 1.189... x 10*4932 33-36 


How Many Digits Do | Get, 6 or 9? 


How many decimal digits of accuracy do I get with a binary floating point number? You often see a 
range: 6 to 9 digits. Huh? 


Wobble, but don’t fall down: The idea of “number of digits of accuracy” is somewhat flawed. Six 
digits of accuracy near 100,000 is ~10 times worse than 6 digits of accuracy near 999,999. The smallest 
increment is | in the least-significant digit (1 ULP = unit in the last place). One in 100,000 is accuracy of 
10°; 1 in 999,999 is essentially 10-°, or 10 times more accurate. 
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Aside: The wobble of a floating point number is the ratio of the lowest accuracy to the highest accuracy for a 
fixed number of digits in its representation (e.g., digits in base 2, 10, or 16). It is always equal to the base in which 
the floating point number is expressed, which is 10 in this example. The wobble of binary floating point is 2. The 
wobble of hexadecimal floating point (mostly obsolete now) is 16. 


We assume IEEE-754 compliant numbers (see later section). To insure, say, 6 decimal digits of 
accuracy, the worst-case binary accuracy must exceed the best-case decimal accuracy. For IEEE binary32, 
there are 23 fraction bits (and one implied-1 bit), so the worst case accuracy is 2? = 1.2 x 10-7. The best 6- 
digit accuracy is 10-°; the best 7 digit accuracy is 10-’. Thus we see that single-precision guarantees 6 
decimal digits, but almost gets 7, i.e. most of the time, it actually achieves 7 digits. Though again, “significant 
digits” is a flawed concept; “relative error” is better. 


The table in the next section summarizes precision in 4 common floating point formats. 


How many digits do | need? 


Often, we need to convert a binary number to decimal, write it to a file, and later read it back in, 
converting it back to binary. An important question is, how many decimal digits do we need to write to 
insure that we get back exactly the same binary floating point number we started with? In other words, how 
many binary digits do I get with a given number of decimal digits? (This is essentially the reverse of the 
preceding section.) We choose our number of decimal digits to insure full binary accuracy (assuming our 
conversion software is good, which is not always the case). 


Our worst-case decimal accuracy has to exceed our best-case binary accuracy. For IEEE binary32 
(single precision), the best accuracy is 24 = 6.0 x 10°°. For 9 decimal digits, the worst case accuracy is 
10-8, so we need 9 decimal digits to fully represent binary32. Below is a table of precisions for 4 common 
formats; note that all but the x86 long double format use an implied leading ‘1’ bit, so the precision is one 
more than the number of stored mantissa bits. 


Significant | Minimum decimal digits | Decimal digits for exact | Decimal 
Format bits accuracy replication digits range 


PPalax7S6 [P*=60x0a>9 [6-9 | 
2°? =2.2 x 10-16=> 15 }2%=1.1x 10-16=> 17 | 15-17 


28=1.1 x 10-19=> 18 |2°=54x 10-20 => 21 | 18-21 
21? = 1.9 x 10-34 => 33 | 213 =9.6 x 10-35 => 36 | 33 — 36 


These numbers of digits agree exactly with the quoted ranges in the “IEEE Floating Point” section below, 
and the ULP table in the underflow section. In C, then, to insure exact binary accuracy when writing, and 
then reading, in decimal, for binary64, use: 

sprinti(dec;, "S.17¢q", x); 


Warning: it is wholly inadequate to test “uniformly” distributed real numbers on any interval, because that 
does not exercise the full range of exponents in floating point representations. It is sufficient to test uniformly 
distributed bit patterns. 

Emulate vs. Simulate 


(where to put??) There is substantial disagreement on the difference between emulation and simulation, 
and there is some genuine gray area between them. However, we think the best definitions are: 


simulate Use modeling to compute aspects of a system, without actually producing any of the 
physical interfaces or interactions of a real system. It’s all done in the computer. 
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emulate Mimic at least some of the physical interfaces or interactions of a system; this may include 
simulating some aspects of the system, but usually includes some hardware beyond a 
general purpose computer. 


Emulators may sometimes run at full speed, but often must run at reduced speed due to physical limitations. 
Simulators have no timing constraints, because they aren’t physically “doing” anything; they usually run 
orders of magnitude slower than the real system. 


We illustrate the gray area with an example: consider a touchpad. It’s a simple device, with a touch 
screen, and software to interact with a user. Suppose we write a program for a computer with a touchscreen 
that performs the functions of some touchpad program. Is the computer program a simulator, or an emulator? 
Since it reproduces the physical interactions of the user touch interface, it fits the definition of an emulator. 
On the other hand, it was all done on the computer, with no additional hardware, so it feels like a simulator. 


Many computing systems do not have floating point hardware. The floating point operations are 
implemented in a software library. This is neither simulation nor emulation; it’s just an implementation. 
However, we have worked on systems with a floating point architecture (instruction set), but with 
inexpensive versions of hardware that do not have built-in floating point. Nonetheless, they run code with 
floating point instructions by trapping on them, invoking software routines to perform the floating point 
operation, and returning to the mainline execution. The mainline code does not “know” whether the floating 
point is done is software or hardware, nor should it. Such trapping is usually called floating point emulation, 
because the software is actually executing the floating point instructions, albeit at slower speeds. The 
emulator is providing the instruction set interface to the application code. 


Programming Guidance for Floating Point 


Even in IEEE compliant systems, the same expression can evaluate to different results in different 
contexts or locations in the code. Most languages do not demand that two instance of the same expression 


always evaluate to the same result, and in many real cases, they do not. A trivial example is: 
double x = 1./3.; 
ie (se SS Dag Sa) wes // ill-posed 


Is the above comparison equal or not equal? You can’t know. But all such code is ill-formed. 


It is fundamental to floating point that the unavoidable 
approximations in computations mostly prevent exact comparisons. 


The reason the above code is ill-posed is that many architectures (e.g., x86) keep intermediate results in 
registers of higher precision than the constituent values. ‘x’ above is a binary64, and will be rounded to that 
precision. In the ‘if statement, 1./3. will likely be held in a register, which on the x86 is 80 bits. A 64-bit 
version of 1/3 is not the same as an 80-bit version of 1/3. Therefore, it is well-known that comparisons should 


generally test for equality within some allowable range (which depends on the application): 
Lf fabs (x — 12/35.) < epsilon) Sand // well-posed 


Sometimes you must check for some relative error, rather than an absolute error. For example, to compare 
‘x’ and ‘y’ (assuming x > 0): 
if( fabs ((x - y)/x = 1.) < epsilon) sane // well-posed 


It seems clunky, but it is widely used. It gets clunkier if there’s a chance that x = 0. However, you can write 
a function to centralize it all. 


One of the few times you can exactly compare floating point numbers is when they have been directly 


assigned integer values: they can be counted on to be exact, because there is no possibility of rounding: 
float x = 27.; 
if(x == 27.) ... // well-posed 
ie (int (x) == 27) was // well-posed 


(Of course, ‘x’ must be big enough to hold the integer.) However, any intermediate fractional computations 


break this assurance: 
fleat x = 27. /5*5s 


results in an unpredictable value of ‘x’. All we can rely on is that ‘x’ is close to 27. 
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Some other considerations are in [Unk 1999]. 


IEEE Floating Point 


This section describes IEEE-754-2008, which updated the 1985 version. There is a 2019 update which 
makes only minor changes. Much of the information here comes directly from the standard. Much of the 
tabular information comes from Sun’s Numerical Computation Guide, 2003 [Sun 2003]. Floating point 
computation is a large and intricate field, and we strongly recommend consulting more detailed references 
(including the sections above) before making assumptions about how it works. 


Note: the standard uses the word “exception” to mean something different than the definition used by 
many computer languages [[EEE-754 p3b]. For example, in C/C++, IEEE “exceptions” are turned into 
“signals,” and do not trigger C++ “exceptions.” 


Goals 


IEEE-754 lists goals in both the introduction [p. iv], and the overview [p1]. Essentially, they want to 
make it easier to write portable numeric code, and have such code report errors consistently across all 
platforms. And they have succeeded well in both of these. 


They also hope (in both the introduction and overview) that computation results will be identical across 
platforms. This is probably a pipe dream. There are many factors working against this goal, most of which 
are outside our scope here. Essentially though, the cost to achieve this goal is enormous, and most users are 
simply unwilling to pay such a price. However, the standard recommends languages define a “literal 
meaning” that, in principle, could allow identical results on disparate platforms [[EEE-754 p50], even at the 
cost of runtime efficiency. For many platforms, this recommendation is not feasible. 


What Is IEEE Arithmetic? 


In brief, IEEE 754 specifies exactly how floating point operations are to occur, and to what precision. 
Though it does not specify how the floating point numbers are stored in memory, it does specify bit-level 
interchange formats. Each computer makes its own choice for how to store floating point numbers. We give 
some popular formats later. 


The original specification, IEEE-754-1985, defined binary floating point. TEEE-754-2008 added 
decimal formats. We discuss here only binary floating point. 


IEEE 754-2008 specifies (among other things) a binary floating point standard with: 


e Three basic binary floating-point formats: binary32 (32-bits), binary64, and binary128 [TEEE-754 
p6]. (The 32 and 64-bit formats were called “single” and “double” in the 1985 version of IEEE- 
754.) 


e Every representable number has a unique representation [[EEE-754 p9t]. 
e Recommendations for extending these formats to other lengths [IEEE-754 p6]. 


e Precise accuracy requirements for mathematical operations, including rounding. Though individual 
operations require bit-level agreement between implementations [[EEE-754 p. iv, top], other factors 
usually prevent such agreement on the whole. 


e Four rounding modes for binary [IEEE-754 sec. 4.3.3 pl6b], only one of which is widely used: 
“roundTiesToEven.” 


e Precise accuracy requirements for conversion between decimal and binary, including guarantees of 
exact recovery from binary->decimal->binary conversions [[EEE-754 p30]. 


e Representations for negative zero, +infinity, and two forms of NaN (not a number). 
e Five kinds of exceptions: invalid operation, division by zero, overflow, underflow, and inexact. 


The standard does not provide: 
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e Function libraries (sin, cos, etc.), which often do not have bit-level accuracy, and therefore may 
differ between platforms. 


e Precise handling of intermediate results: in a multi-step calculation, rounding may occur at 
different places in a full-language programming environment, even between conforming IEEE 
platforms. Therefore, “conforming” platforms may yield different results. 


Note that most floating point programs work fine on multiple platforms, and even on non-compliant 
floating-point implementations, despite tiny differences in their results. Numerical computation existed long 
before the first standard in 1985, and was eminently useful. 


High level languages have different names for floating point data types, which usually correspond to the 
IEEE formats as shown here [Sun 2003]: 


IEEE Formats and Example Language Types 


binary32 REAL or REAL*4 
binary64 DOUBLE PRECISION or REAL*8 


double extended (80-bit | ong double 


binarv128 REAL*16 [e.g., SPARC]. Note that in many 
y implementations, REAL* 16 is different than ‘long double’ 
* 80-bit double-extended is not part of the standard, and does not comply with the interchange format for 


“extended” precisions, due to the explicit ‘j’ field. However, it is a widely used part of the x86 architecture. 


[IEEE-754 2008 p13] defines a binary16 format, but it is rarely used. 


Binary Numeric Formats 


A binary floating point number is represented by 3 fields: S, E, and T: a sign bit, an exponent, and a 
significand (aka mantissa) (Figure 14.1). The value v of the number is: 


v= (-1)%8" (227""") (significand) , e.g. —(2!°) 1.625 [TEEE-754-2008 p9]. 
p is defined as the number of bits in the significand (the precision), e.g. for binary32, p = 24 [IEE-754 p8]. 
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binary32 (6-9 decimal digits) 


s : vl de Not to scale 
S E T 
31 30 23 22 0 


binary64 (15-17 decimal digits) 


1 11 52 LSB 
S E T 
63 62 5251 0 


Double-Extended (x86 long double) (18-21 decimal digits) 


1 15 1 63 LSB 
S E J T 
79 78 64 63 62 0 


binary128 (33-36 decimal digits) 


1 15 112 LSB 
S \3) jk 
127 126 112 111 0 


Figure 14.1 Common floating point formats, with S, E, T fields named as in [[EEE-754 p9]. 
(However, IEEE-754 numbers the bits from 0 on the left.) 


As in the standard, we sometimes refer to the bit-fields E and T by their integer interpretation. Then in 
binary32, E = 128 corresponds to an exponent of 1 = 128 — bias. 


When a Bias Is a Good Thing 


IEEE floating point uses biased exponents, where the actual exponent is the unsigned value of the ‘E’ 
field minus a constant, called a bias: 


exponent = E — bias where E=biased exponent . 


The bias makes the E field an unsigned integer, and the smallest numbers have the smallest E field (as well 
as the smallest exponent). The bias provides that (1) floating point numbers sort in the same order as if their 
bit patterns were integers; and (2) true floating point zero is naturally represented by an all-zero bit pattern. 
These might seem insignificant, but they are quite useful, and so biased exponents are nearly universal. 


In IEEE floating point, two values of E are special: 0, and all 1’s. Briefly: E = 0 is used for subnormal 
numbers (described later), and is the same exponent value as E = | (exponent = | — bias). For example, in 
binary32, E is an 8-bit field, and the bias is 127. The maximum exponent is then 254 (OxFE), and the 
minimum is —126. When Eis all 1’s (255 in binary32), the number is either an infinity or NaN (not a number). 
Full details to come. 


Interchange Formats and Storage Formats 


IEEE-754-2008 defines an “interchange format” to be used for exchanging data between systems. 
However, within a single system, each implementation defines its own storage format (how a number is 
stored in memory). Oddly, [IEEE-754 Table 3.5 p13] defines the “storage width” in bits, and the number of 
bits in each field, but this certainly describes the interchange format. 


Note that two’s complement is not used for negative numbers. 
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Each computer defines its own storage formats, though they are obviously all related. 


IEEE binary32 (Single) Format 


The table below shows the constituent fields, S, E, and T, and the values represented [Sun 2003]: 


1<E<254 (-1)§ x 2F!?7 x 1.T (normal numbers) 
E=0; T#0 (-1)§ x 2-!°6 x 0.T (subnormal numbers) 


E=0;T=0 (-1)§ X 0.0 (signed zero) 
E=255; T=0 (-1)§ X oo (+ infinity) 
S =either; E = 255; T #0 NaN (Not-a-Number) 


On x86, which is little-endian, the least significant byte is stored first. 


Note that when | < E < 254, the significand is found by prepending “1.” to the fraction in T: 
y= (-1)° x QE-bias + | x1.T. 


The “1.” is implied, and is called the “implicit bit.” The result is that | < significand < 2. Thus the 23-bit T 
field yields 24 bits of precision. 


Below, the hexadecimal numbers follow Figure 14.1, with the sign and exponent on the left. The decimal 
values are rounded [Sun 2003]. 


Important Bit Patterns in IEEE Binary32 Format 


fo —~ivo0 000 foo 
Eo io 0000 Foo 


ee 


* NaN requires only that the most significant bit of T = 1, so the ‘C’ hex digit can be C - F. 


binary64 (Double) Format 


The IEEE binary64 format is an obvious extension of the binary32 format: a 52-bit trailer, T; an 11-bit 
biased exponent, E; and a 1-bit sign, S. The 52-bit fraction combined with the implicit leading 1-bit provides 
53 bits of precision. The table below shows the values represented [Sun 2003]: 
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1<E< 2046 (-1)§ x 2F!°3 x 1.T (normal numbers) 
E=0;T40 (-1)§ x 2712 x 0.T (subnormal numbers) 


E=0;T=0 (-1)$ x 0.0 (signed zero) 
E = 2047; T=0 (-1)$ X oo (+ infinity) 
S = either; E = 2047; T 40 NaN (Not-a-Number) 


On SPARC, which is big-endian, the most significant 32 bits are stored as the first word. x86 is little- 
endian, so the least significant byte is stored first. 


Below, the hexadecimal numbers follow Figure 14.1, with the sign and exponent on the left. The decimal 
values are rounded [Sun 2003]. 


Important Bit Patterns in IEEE binary64 Format 
+0 
800 00000 00000000 


fo 
oT 


* NaN requires only that the most significant bit of T = 1, so the ‘8’ hex digit can be 8 - F. 


binary128 (Double-Extended) Format 


SPARC quadruple-precision conforms to IEEE binary128 format. Analogous to binary64, the first 32- 
bit word holds the sign, exponent, and most-significant bits of T. This table shows the values represented 
[Sun 2003]: 


(—1)§ x 2516383 x 1.T (normal numbers) 
(—1)§ x 2-168? x 0.T (subnormal numbers) 


(-1)° x 0.0 (signed zero) 
(-1)$ X oo (+ infinity) 


S = either, E = 32767, T#0 NaN (Not-a-Number) 
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On SPARC, which is big-endian, the most significant 32 bits are stored as the first word. x86 is little- 
endian, so the least significant byte is stored first. 


Below, the hexadecimal numbers follow Figure 14.1, with the sign and exponent on the left. The decimal 
values are rounded [Sun 2003]: 


Important Bit Patterns in IEEE binary128 Format 


Approximate Value 
+0 | 0000 0000 00000000 00000000 00000000 


8000 0000 00000000 00000000 00000000 


1.189 731 495 357 231 765 
085 759 326 628 0070 e+4932 


: 3.362 103 143 112 093 506 
1 
nin normat [000 0000 00000000 00000000 00000000 262 677 817 321 7526 e-4932 
min pos 
subnormal 
+00 


* NaN requires only that the most significant bit of T = 1, so the ‘8’ hex digit can be 8 - F. 


KR PREF FPREEPPPE FRREPPPEPR FPREEEEP EE 


0 
4000 0000 00000000 00000000 00000000 |2.0 


x86 Double-Extended Format 


The x86 double-extended format is not part of the standard. The important difference in the x86 double- 
extended (aka long-double) format is the lack of an implicit leading 1-bit in the significand. Instead, the 1- 
bit is explicit in the j field, and always present in normalized numbers. The binary (radix) point is between j 
and T, which together compose the significand. This clearly violates the spirit of the IEEE standard. 
However, big companies carry a lot of clout with standards bodies, so Intel claims this double-extended 
format conforms to the IEEE definition of double-extended formats, because IEEE 754 does not specify how 
(or if) the leading 1-bit is stored. Thus, x86 long-double consists of four fields (rather than three): a 63-bit 
fraction, T; a 1-bit explicit leading significand bit, j; a 15-bit biased exponent, E; and a |-bit sign, S (note the 
additional j field as the explicit leading bit). 


Nonetheless, for all normalized numbers (1 < E < 32766), j must be 1. Operating on a value with 1 < 


E < 32766 and j = 0 triggers an invalid operation. Only subnormal numbers can have j = 0. 


In x86 architectures, a double-extended number occupies 10 bytes, in little-endian order. However, for 
parameter and result passing, the UNIX System V Application Binary Interface Intel 386 Processor 
Supplement (Intel ABI) specifies 12-bytes, with the most significant two bytes being unused (see below). 
(This is likely due to the run-time speed benefits of 32-bit aligned data in memory.) 
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1 15 1 63 LSB 
unused S E j T 
95 80 79 78 64 63 62 0 


Figure 14.2 Unix ABI format for 80-bit double-extended (long double) includes two pad bytes. 


The x86 architecture also specifies two kinds of NaN: a quiet NaN which always propagates through 
calculations without raising a hardware signal, and a signaling NaN, which raises a hardware signal if it is 
used in any operation. 


Below shows the four constituent fields and the values they represented. x = don’t care [Sun 2003]. 


(-1)8 x 2 1683 x 1.T (normal numbers) 
(—1)§ x 271682 x 0.T (subnormal numbers) 


3 =0,E=0,T=0 (-1)$ x 0.0 (signed zero) 
3 =1;S =0/1; E= 32767; T=0 + oo (+ infinity) 


3 =1;S =x; E=32767; T =.1xxx...xx QNaN (quiet NaNs) 
3 =1;S =x; E= 32767; T = .Oxxx...xx # 0 SNaN (signaling NaNs) 


Below are some important bit patterns in the double-extended format [Sun 2003]. 


Important Bit Patterns in Double-Extended (x86) Format and their Values 


so _—____ov00 oooveenn oav0o00e foo 
Fo 800 vooveenu ox000000 [ag 


1.189 731 495 357 231 765 

max normal 7FFE FFFFFFFF FFFFFFFE 05 44932 
3 es 3.362 103 143 112 093 506 

1 

min positive normal 0001 80000000 00000000 26 e-4932 

3.362 103 143 112 093 506 
7 

max subnormal 0000 FRFEEFFEFF FEEFE FEE 08 e-4932 

min positive subnormal 0000 00000000 O0000001 Feb e S82 414002 


f= [ree e0o0000 o0o00000 [= 


All NaN’s have E = Emax. A quiet NaN has the most significant trailer (T) bits = 11, so the leading hex digit 
may be C-F. A signaling NaN sets the most significant T bits to 10, so the leading hex digit may be 8-B. 
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Precision in Binary and Decimal 


See the earlier section on How Many Digits Do I Get? for introductory information. Because decimal 
numbers are different than binary numbers, estimating the number of significant decimal digits corresponding 
to b significant binary bits requires some definition. Figure 14.3 illustrates that representable decimal base 
numbers are different than those in binary. Although all binary numbers can be represented exactly in 
decimal, this usually requires unreasonably many digits to do so. Therefore, exact conversions are generally 
not possible. 

10"+2 


n+1 
decimal 1 10 


binary 


ie 
2 ml \ 


Figure 14.3 Comparison of numbers representable in decimal and binary. 


However, as noted earlier, conversions from binary->decimal->binary can be done which preserve 
exactly the original binary value, and this is required by [IEEE-754 sec. 5.12]. For example, in a typical 
system, converting sqrt(2) to 19 decimal digits yields: 


sqrt(2) = 1.414 213 562 373 095 146 


The following calculations show that in this case, 17 decimal digits represent the value exactly, and 16 digits 
do not: 


sqrt(2) — 1.414 213 562 373 095 1=0 

sqrt(2) — 1.414 213 562 373 095 = 2.220 446 049 250 313 081 e-16 
However, for some numbers, a rounded 16 digits is adequate: 

sqrt(5) = 2.236 067 977 499 789 805 

sqrt(5) — 2.236 067 977 499 790 = 0 
We showed earlier that (for good conversion software), 17 digits is always enough to represent binary64 
numbers. 


The Big ULP 


Before we discuss underflow, we need to understand concepts and terminology related to precision. 
Most floating point calculations have to be rounded. Just as in ordinary scientific notation, the round off 
error depends on the magnitude of the number being rounded. Using binary32 as an example, consider the 
number: 


1.000 0000 0000 0000 0000 0001 «2°. 


The smallest increment we can make to such a number is to add one unit in the last place (ULP). In this 
case, that’s 23. Ideally, any roundoff will be less than this; in fact, roundoff should be < (1/2)ULP = 
(1/2)2%. A number with a bigger exponent will have a bigger ULP: 


1.000 0000 0000 0000 0000 0001 x 27 = ULP = 27772? =27! | 


Since the ULP depends on the number, we can write ULP as a function of that number, ulp(x). This 
function is often available to programmers, such as in the Boost C++ libraries. Here are ulp(1.0) for some 
common precisions [Sun 2003]: 
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ulp(1.0) in Different Precisions 


binary32 (single) ulp(1.) = 2 ~ 1.192 093 e-07 


binary64 (double) ulp(1.) = 2? ~ 2.220 446 e-16 


double extended (x86) ulp(1.) = 2 ~ 1.084 202 e-19 


binary128 ulp(1.) = 27!” ~ 1.925 930 e-34 


Along these lines, there is a standard C++ library function nextafter (x,y) that returns the next 
floating point number after x in the direction of y (up or down). Here are several examples [Sun 2003]: 


Gaps Between Representable binary32 Numbers 


oo _—[tatiansseas_[raoaenseas 


Round and Round 


Unfortunately, there are several ways to round off numbers, and IEEE-754 specifies four different ways 
(for binary). The only reasonable rounding method is the default, called roundTiesToEven. This is 
essentially the way scientists learn to round off numbers: fractions >/2 get rounded up, <¥2 get rounded down, 
and exactly =1/2 get rounded to make the preceding digit even. The standard requires exact conformance to 
this rounding, so care must be taken because multiply and add of 24-bit operands can produce 47 or 48 bit 
results. Every bit of the result must be taken into account, if necessary, to properly round. For example, in 
binary32, consider these 48-bit multiplication products: 


1.000 0000 0000 0000 0000 0000 1000 0000 0000 0000 0000 0001 rounds to 
1.000 0000 0000 0000 0000 0001 


but 


1.000 0000 0000 0000 0000 0000 1000 0000 0000 0000 0000 0000 rounds to 
1.000 0000 0000 0000 0000 0000 


Efficient implementations need not compute all 48 bits of the result to effect proper rounding. They can 
actually keep only 25 bits of result, along with a flag indicating that one or more 1-bits have been discarded 
so far [Knu_vol2 Ex. 5 p227]: 


1.000 0000 0000 0000 0000 0001 1 + (discarded 1’s flag) 
To round: 
e = if the 25" bit is 0, round down; 
e if itis 1 and there are any discarded 1’s, round up; 


e ifitis 1 and no discarded 1’s, round to preceding digit even. 
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Why roundTiesToEven? A long calculation accumulates more and more roundoff error as it 
progresses. Many operations round, and the error (on average) gets worse. With grade-school-type rounding, 
where ties (e.g., 0.5) always rounds up, the rounding error is biased: it has a non-zero average, because it 
slightly favors rounding up. The (average) cumulative rounding error is then proportional to n, the number 
of operations. 


In contrast, roundTiesToEven is unbiased: half the time ties goes up, and half the time they goes down. 
The average magnitude of the rounding error is proportional to n!”, which is generally much smaller. 


roundTiesToEven is also symmetric: round(x) = —round(-x). 


Other rounding modes: The standard suggests that the other rounding modes might be useful for 
diagnosing numerical problems [[EEE-754 p55]. There are some rare calculations, such as interval 
computing, that can try to bound the errors on a given calculation. One approach is to run the calculation 
with “roundTowardNegative,” and then run it again with “roundTowardPositive.” Under some reasonable 
conditions, the true answer will be between the two, so they establish error bounds. 


We have seen several software implementations that provide only the default rounding mode, 
roundTiesToEven. 


Underflow 


Normalized numbers have a minimum nonzero magnitude, given in the tables above for each precision. 
Before the standard, any result smaller than this minimum was often simply replaced by zero (an underflow 
method called “store 0”). This sometimes causes “jerkiness” in sensitive algorithms near zero. Instead, 
IEEE-754 specifies a “smoother” method, that actually represents numbers smaller than the minimum 
normal, called “gradual underflow.” These unnormalized numbers are called subnormal numbers. 


The idea is simple: subnormal numbers simply don’t have an implied ‘1’ as the most significant bit of 
the mantissa. Instead, their value is: 


y= (-1)° x Q-bias + 1 x 0.T y 


Subnormal numbers always have the smallest allowed exponent, e.g. —126 in binary32, though it is stored as 
E=0. Thus E = 1 and E = O are the same exponent, but the former has an implied 1-bit, and the latter does 
not. Therefore, the smallest representable magnitude is, in binary32 format: 


0.000 0000 0000 0000 0000 0001 x 27176 = 2-3 2-176 = 271 =14x10%. 


The following table repeats the minimum normal magnitudes given earlier, and includes the next lower 
representable magnitude, which is the /argest subnormal magnitude [Sun 2003]: 


Underflow Thresholds in Each Precision 
binary32 smallest normal number 1.175 494 35 e—38 
(single) largest subnormal number | 1.175 494 21 e—38 
binary64 smallest normal number 2.225 073 858 507 201 4 e—308 
(double) largest subnormal number | 2.225 073 858 507 200 9 e—308 


double smallest normal number 3.362 103 143 112 093 506 26 e-4932 

extended (x86) | largest subnormal number | 3.362 103 143 112 093 505 90 e-4932 

binarv128 smallest normal number 3.362 103 143 112 093 506 262 677 817 321 752 6 e-4932 
y largest subnormal number | 3.362 103 143 112 093 506 262 677 817 321 752 0 e-4932 


Subnormal numbers have fewer bits of precision than normal numbers, and the precision decreases as 
the subnormal numbers get smaller. This decreasing precision is called gradual underflow. The inclusion 
of gradual underflow into the standard was controversial at the time, but is now almost universally accepted. 


Gradual underflow means the absolute roundoff error for subnormal numbers is the same as for the 
smallest normal numbers. Of course, the relative roundoff error is worse for subnormal numbers, because 
they are smaller magnitude than normal numbers. 
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Gradual underflow has several useful properties. An important one is: 
x#y & x-y#0. 


This is true mathematically, but with Store 0, x — y might underflow and appear to be 0, when in fact x # y. 
With gradual underflow, x # y is always equivalent to x — y#0. This is because of the following: if x and y 
have the same exponent, then x — y is exact, even with subnormal numbers. If x and y have different 
exponents, then at least one of them is normalized, and x — y cannot be less than subnormal, and hence cannot 
be zero. 


Note that even with gradual underflow, multiplication can produce surprising results: even if x and y are 
both > 0, x*y can be “zero.” By default, this will signal an underflow arithmetic fault to the program. 


Most numerical programs work fine with no consideration of underflow; most programs never create or 
use such small numbers. However, when they do, gradual underflow relieves the application designer of a 
lot of work and error analysis related to the old-fashioned Store 0 approach. 
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15 Scientific Programming: Discovering Efficiency | 
Programming for science, especially research, 
is substantially different than for commercial software. 


We focus here on scientific programming, which is neglected in most references. 


I was a commercial software engineer for about 15 years, and have been in research science for about 
15 more (as of 2018). I have been engineering software for 45 years, for both stand-alone computers and 
embedded systems. I have coded extensively in C, C++, Fortran, Pascal, and various Assemblers. I’ve used 
Python a fair bit, Basic, Emacs MLisp, some specialized application-specific array languages (e.g., CALMA 
GPL), and experimented with oddball ones like Lisp, Trac, and machine micro-code. I’ve written Fortran 
compilers, assemblers, link-editors, CPU and physics simulators, done kernel-level OS modifications, and 
low-level device drivers. That said, I’ve been out of mainstream software engineering for 15 years, so I’m 
not fully current on all the newest language features. 


I’ve seen both research code and commercial software in detail, and seen the difference: the tradeoffs 
are different between scientific programming and commercial programming. Much more of scientific 
programming is one-off: never to be used again. Scientific programming is often (but not always) done with 
a smaller group of programmers, often just one. Commercial software typically involves a handful to 
hundreds of engineers working on the same build, or suite. Scientific programs are often much smaller than 
commercial programs. That doesn’t mean research code should be sloppy; it shouldn’t. 


One reason commercial programming dogma fails in scientific programming is simple: overkill. 


While much commercial dogma is actually bad even for commercial software, it is more-bad for research 
software. Far too many researchers are barraged with extremism: “never use global variables!”, “never use 
‘goto’!”, “never use ‘break’ or ‘continue never return from the middle of a function!”’, etc. 


2499 «6 
came) 


That said, much of commercial software development advice is good for scientific programming, as 
well. This chapter presents our views of good and bad advice. 


There are two very different kinds of “efficiency”: 
(1) “development efficiency:” the time it takes you to get your code working (design, code, debug); and 
(2) the run-time efficiency of the code itself. 


Usually, your development time is by far more important, and the run-time efficiency is not important. 
However, my own dissertation required attention to run-time efficiency. Furthermore, supercomputer centers 
demand that your code meet specific run-time efficiency standards. Therefore, we address both kinds of 
efficiency. 


So sit back, take an alprazolam, and give it a skim. 


Software Development Efficiency 


Some Do’s and Don’t’s 
We expound later on some of the following recommendations: 
e Do design your code up front: before coding, think about where you’re going, and might go. 
e Do “modularize” your code: separate it into hierarchical chunks. 
e Do get your code working first; optimize it later only where needed. 
e Do include lots of comments in-line: seconds now save days or weeks later. 
e Don’t be a zealot; avoid the word “never.” 


e Do be an engineer and focus on efficiency, not dogma. 
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e Douse coding guidelines. If you don’t have any, use mine: 
https://elmichelsen.physics.ucsd.edu/ . 


e Do use unwieldy language features sparingly: global variables, ‘goto’, embedded ‘return’, etc, 
and include comments explaining why you chose them. 


e Don’t complicate simple code, unless there is a clear payoff exceeding the bloat cost. 


e If it ain’t broke, be sure it’s properly maintained so it doesn’t break in the future. 


Considerations on Development Efficiency and Languages 


Programming languages are hammers. Not all programming jobs are the same; use the right hammer 
for the job. The following advice derives from my decades of engineering, programming, and science 
experience described earlier. 


I’ve seen the productivity benefits of ego-free programming, and I strongly discourage zealotry about 
hammers, or languages. It’s bad for science; it’s bad for your career. 


I emphasize that C++ is useful without a paradigm shift. 


Arguably, C++ allows a paradigm shift to object-orientation, but it certainly does not require it. You can 
reap lots of productivity gains with just a few simple classes, and no paradigm shift. In fact, scientific data 
analysis often gains nothing from shifting paradigms, which would just waste a lot of time. If you’re used to 
C, or any language, then ease into C++ (or object orientation) gradually. It’s OK. The alleged paradigm 
shift is more useful for complex data structures, which scientific data usually are not. 


Most people program Python without writing their own classes, again indicating that end-user-defined 
object-orientation is often not very useful. Especially so in scientific software. 


Sophistication Follows Function 


It is just as important for a computer program to communicate to humans as to machines. 


When coding anything, the sophistication of your code should scale up with the demands of its function. 
Simple functions need only simple code; don’t make complexity for no reason. Especially don’t make 
complexity just to conform with zealous dogma. There’s a huge difference between a 1000 line concept 
experiment, and a 100,000 line solar-system simulator meant to be used for decades. As a concrete example, 
consider the following two snippets: 


for (vector<datapoint>::iterator index = data.begin(); index != data.end(); 
++index) 
do_something (*index) ; 


VS. 


for(int 1 = 0; a2 < ndataz Fi) 
do something (data[i]); 


These two snippets do essentially the same thing. Which would you rather read? For most scientific 
programming, the latter is far preferable than the bloated former. In science, a simple list of data points is a 
common and well-understood data structure. Don’t bloat it unless there is a clear payoff that exceeds the 
cost of the bloat. Libraries like the Standard Template Library are designed to be extremely general; such 
generality inevitably carries a burden of complexity. 


In fact, the awkwardness of the first snippet led to a new language feature in C++11, the range-based 
‘for’ loop. This exemplifies a serious problem with the ever-evolving C++ language: the solution to its 
clumsiness is (too-often) ever-growing complexity. Rarely does the language evolve in a way to eliminate 
the need for ever-more remedies. (Another example is “move constructors” and “move assignment”, which 
only partially solve the target problem, add more complexity to class methods, makes the language self- 
contradictory, and could have been mostly avoided by simply allowing coders to explicitly specify the 
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destination object separately from the two operands. This would have made compilers simpler, too. More 
later.) 


Engineering vs. Programming 


Software Engineering is much more than computer programming: it is the art and science of designing 
and implementing programs efficiently, over the long term, across multiple developers. Software engineering 
maximizes productivity and fun, and minimizes annoyance and roadblocks. 


Engineers first design, then implement, systems that are useful, fun, and efficient. 


Hackers just write code. Software engineering includes: 


e Documentation: lots of it in the code as comments. (“Commercial Fortran codes often contain about 
50% comments.” [web.stanford.edu/class/me200c/tutorial_77/03_basics.html retrieved 2000-06- 
03].) 


e Documentation: design documents that give an overview and conceptual view that is infeasible to 
achieve in source code comments. 


e Coding guidelines: for consistency among developers. Efficiency can only be achieved by 
cooperation among the developers, including a consistent coding style that allows others to quickly 
understand the code. E.g., https://elmichelsen.physics.ucsd.edu/Coding_Guidelines.pdf . 


e Clean code: easy to read and follow. 


e Maintainable code: it functions in a straightforward and comprehensible way, so that it can be 
changed easily and still work. 


Notice that all of the above are subjective assessments. That’s the nature of all engineering: 


Engineering is lots of tradeoffs, with subjective approximations of the costs and benefits. 


Don’t get me wrong: sometimes I hack out code. The judgment comes in knowing when to hack and when 
to design. 


Fun quotes: 


“Whenever possible, ignore the coding standards currently in use by thousands of developers in your 
project’s target language and environment.” 
- Roedy Green, How To Write Unmaintainable Code, www.strauss.za.com/sla/code_std.html 


“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as 
cleverly as possible, you are, by [implication], not smart enough to debug it.” - Brian W. Kernighan 


Coding guidelines make everyone’s life easier, even yours. - Eric L. Michelsen 


Object Oriented Programming 


This is a much used and abused term, with no definitive definition. The goal of Object Oriented 
Programming (OOP) is to allow reusable code that is clean and maintainable. The best definition I’ve seen 
of OOP is that it uses a language and approach with these properties: 


e User defined data types, called classes, which (1) allow a single object (data entity) to have 
multiple data items, and (2) provide user-defined methods (functions and operators) for 
manipulating objects of that class. 


e Information hiding: a class can define a public interface which hides the implementation details 
from the (client) code which uses the class. 


e Overloading: the same named function or operator can be invoked on multiple data types, including 
both built-in and user-defined types. The language chooses which of the same-named functions to 
invoke based on the data types of its arguments. 
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e Inheritance: new data types can be created by extending existing data types. The derived class 
inherits all the data and methods of the base class, but can add data, and override or overload any 
methods it chooses with its own, more specialized versions. 


e Polymorphism: this is more than just overloading [ASU 1986 p344]. Polymorphism allows 
derived-class objects to be handled by (often older) code which only knows about the base class 
(i.e., which does not even know of the existence of the derived class.) Even though the application 
code knows nothing of the derived class, the data object itself insures calling proper specialized 
methods for itself. 


In C++, polymorphism is implemented with virtual functions. 


OOP does not have to be a new “paradigm.” It is usually more effective to make OOP an 
improvement on the good software engineering practices you already use. 
Take it one step at a time. Don’t redesign the world for no reason. 


Further Musings on Coding and OOP 


I have a moderated view of coding. I think the difference between “procedural” and “object-oriented” 
code is much less than many people do. All code is procedures for manipulating data. And breaking up 
processing into hierarchical chunks isn’t OO, it’s just modularizing. Good code has always done that, long 
before OO was thought of. 


Object oriented code can still be terrible. 


There is a much bigger difference between commercial code and scientific code than there is between 
so-called “procedural” code and OO code. A lot of scientific code has no big future. All the work I did for 
my dissertation is dead, and will never live again. I’m completely comfortable having used C++ as a simple 
tool to get my dissertation job done, using only a few small classes, and using all the global variables that I 
did. That was the nature of the hammer I needed in order to graduate. Anything else would have been a 
complete waste of my time. 


That said, my code is heavily commented, and anyone who might inherit it will be much better off than 
most physicists. 


The key to productivity is making conscious, informed trade-offs, 
and documenting them in the comments. 


I’m not anti-OO, it’s just that I know that OO isn’t new. The modern tools have more benefits in 
commercial code than scientific code. Nonetheless, it would improve efficiency if more scientists use OO 
when appropriate, but OO zealotry actually discourages that. The “all or nothing” screed means most coders 
must choose “nothing,” because they don’t have time for “all.” My credo is “Do what makes sense,” which 
usually means starting with a few small classes. As need demands, do more. Baby steps. 


Coding is an art. That’s why I moderate the absolutist views of zealots. 


The Best of Times, the Worst of Times: Run-time Efficiency 


We give here some ways to speed up general computations, using matrices as examples. The principles 
apply to almost any computation performed over a large amount of data. This is not about programming 
styles or paradigms; it’s about how to use language features and machine architecture to speed typical 
scientific code. 


For the vast majority of programs, run time is so short that it doesn’t matter how efficient it is; 


clarity and simplicity are more important than speed. 


In rare cases, time is a concern. For example, in my dissertation I did numerical experiments on both 
simulated and real data. I often ran an hour or two of computations per day, with me examining intermediate 
results along the way. I could not have afforded even a 3x slow down; I never would have graduated. Another 
example: supercomputer programmers are required to use computer time efficiently, and usually must include 
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hardware metrics in the proposal for supercomputer time. Programs that fail efficiency standards are denied 
time. Supercomputer programmers must understand the concepts in this section. 


For some simple examples, we show how to easily cut your execution times to 1/3 of original. We also 
show that execution times change in surprising ways, such as adding operations to reduce run time. This 
section assumes knowledge of computer programming with simple classes (the beginning of object oriented 
programming), and we use C++ as the example language, including some distinctions between C++98 and 
C++11. 


Run time optimization is a huge topic, so we can only touch on some basics. The main point here is: 


For large-data computations, memory management is the key to fast performance. 
Sometimes, easy changes can yield huge benefits. 


We proceed along these lines: 


e <A simple C++ class for matrix addition. We give run times for this implementation (the worst of 
times). 


e A simple improvement greatly improves execution times (the best of times). 

e We try another expected improvement, but things are not as expected. 

e We describe the general operation of “memory cache” (pronounced “cash’”’). 

e Moving on to matrix multiplication, we find that our previous tricks don’t work. 

e However, due to the cache, adding more operations greatly improves the execution times. 


We give actual code listings for the examples, but you can ignore them, and still understand the concepts. 


Example Using Matrix Addition 


One concept in speeding up large-object methods is to avoid C++’s hidden copy operations. However: 


Computer memory access is tricky, so things aren’t always what you’d expect. Nonetheless, 
we can be efficient, even without details of the computer hardware. 


The tricks are due to computer hardware called RAM “cache,” whose general principles we describe later, 
but whose details are beyond our scope. 


First, here is a simple C++ class for matrix creation, destruction, and addition. (For simplicity, our 
sample code has no error checking; real code, of course, does. In this case, we literally don’t want reality to 
interfere with science.) The class data for a matrix are the number of rows, the number of columns, and a 
pointer to the matrix elements (data block). 


typedef double T; // matrix elements are double precision 
class ILmatrix 7/ 2D matrix 
{ publ ae 
int nt, Bes // # crows & columns 
uy *db; // pointer to data 
ThLMmatrix (int 2, ant ¢)3 // create matrix of given size 
ILmatrix(const ILmatrix &b); // copy constructor 
~ILmatrix(); // destructor 


// The following allows 2-index subscripting as: a[i][j] 


T * operator [] (int r) const {return db + r*nc;}; // subscripting 
ILmatrix & operator =(const ILmatrix& b); // assignment 
ILmatrix operator +(const ILmatrix& b) const; // matrix add 
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The matrix elements are indexed starting from 0, i.e. the top-left corner of matrix ‘a’ is referenced as 
‘a[0][0]’. Following the data are the minimum set of methods (procedures) for matrix addition. Internally, 
the pointer ‘db’ points to the matrix elements (data block). The subscripting operator finds a linear array 
element as (row)(#columns) + column. Here is the code to create, copy, and destroy matrices: 


// create matrix of given size (constructor) 
TLmMatrix: FT hLmatrix (int xr, ant ¢) * nmr(r), nec) // set nr & ne here 
{ 
db = new T[nr*nc]; // allocate data block 
} // Timatrix(r, ic) 


// copy a matrix (copy constructor) 
ILmatrixi:Ilmatrix (const IlLmatrix & b) 


{ int Laren 
nr = b.nr, ne = b.ne; // matrix dimensions 
Lf (1 ..db) 
{ db = new T[nr*nc]; // allocate data block 
fer(e = 0; -<nrp +r) // copy the data 
for(c = 0; ec < nc; ++tc) 
(*this) [re] [ce] = Ble) tel? 


} 


} // copy constructor 


// destructor 


ILmatrix: :~ILmatrix () 

{ if (db) {delete[] db; } // free existing data 
nr = nce = 0, db = 0; // mark it empty 

, 

// assignment operator 

ILmatrix & ILmatrix::operator =(const ILmatrix& b) 


t int ee, is 


for(r = 0; xr < nr; ++r) 7/7 copy the data 
for(ec = 0; ¢< ne; ++c) 
(*this) [zr] [le] = ofr] tel? 
return *this; 
} // operator =() 


The good stuff: With the tedious preliminaries done, we now implement the simplest matrix addition 
method. It adds two matrices element by element, and returns the result as a new matrix: 
// matrix addition to temporary 
ILmatrix ILmatrix::operator +(const ILmatrix& b) const 
{ 
int ry, Cer 
ILmatrix result (nr, nc); // create temporary for result 


for (r=0; © < nre ++r) 
for (c=07 <c <= Her  +#e) 
result(s] [ec] = ¢*this) (r] fe] + bie) [els 
return result; // invokes copy constructor! 
} // operator +() 


How long does this simple code take? To test it, we standardize on 300 x 300 and 400 x 400 matrix 
sizes, each on two different computers: computer-1 is a circa 2000 Compaq Workstation W6000 with a 1.7 
GHz Xeon. Computer-2 is a circa 2003 Gateway Solo 200 ARC laptop with a 2.4 GHz CPU. We time 100 


matrix additions, e.g.: 
int n = 300; // matrix dimension 
ILmatrix ain,n);, Dbin,n);, din,n); 


// Addition test 
d=a +b; // prime memory caches 
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cpustamp ("start matrix addition\n"); 
for(i = O; 2 < 100, +42) 

d=a +b; 
cpustamp ("end matrix addition\n"); 


With modern operating systems, you may have to run your code several times 
before the execution times stabilize. 


This may be due to internal operations of allocating memory, and flushing data to disk. 


We find that, on computer-1, it takes ~1.36 + 0.10 s to execute 100 simple matrix additions (see table at 
end of this section). Wow, that seems like a long time. Each addition is 90,000 floating point adds; 100 
additions is 9 million operations. Our 2.4 GHz machine should execute 2.4 additions per ns, which totals 
only about 4 ms. Where’s all the time going? C++ has a major flaw. Though it was pretty easy to create 
our matrix class: 


C++ copies your data twice in a simple class operation on two values. 


So besides our actual matrix addition, C++ copies the result twice before it reaches the matrix ‘d’. The first 
copy happens at the ‘return result’ statement in our matrix addition function. Since the variable ‘result’ will 
be destroyed (go out of scope) when the function returns, C++ must copy it to a temporary variable in the 
main program. Notice that the C++ language has no way to tell the addition function that the result is headed 
for the matrix ‘d’. So the addition function has no choice but to copy it into a temporary matrix, created by 
the compiler and hidden from programmers. The second copy is when the temporary matrix is assigned to 
the matrix ‘d’. Each copy operation copies 90,000 8-byte double-precision numbers, ~720k bytes. That’s a 
lot of copying. 


Optimization 1: What can we do about this? The simplest improvement is to make our copies more 
efficient. Instead of writing our own loops to copy data, we can call the library function memcpy( ), which 


is specifically optimized for copying blocks of data. Our copy constructor is now both simpler and faster: 
ILmatrix::ILmatrix(const ILmatrix & b) 


{ page rare ee 
nr = b.nr, ne = bone? // matrix dimensions 
Lf (had) 
{ db = new T[nr*nc]; // allocate data block 
memcpy (db, b.db, sizeof(T) *nr*nc); // copy the data 


} 


} f// copy constructor 


Similarly for the assignment operator. This new code takes 0.98 + 0.10 s, 28 % better than the original 
code. 


Use built-in and professional libraries whenever possible. 
They’re already optimized and debugged. 


Not bad for such a simple change, but still bad: we still have two needless copies going on. 


Optimization 2: For the next improvement, we note that C++ can pass two class operands to a class 
operator function, but not three. Therefore, if we do one copy ourselves, we can then perform the addition 


“in place,” and avoid the second copy. For example: 
// Faster code to implement d =at+b: 
d= ay // the one and only copy operation 
d += b; // +=" adds ‘b’ to the current value of ‘d’ 


We can simplify this main code to a single line as: 
(d = a) += b; 


To implement this code, we added a “+=” operator function to our class, and it must return by reference, not 
by copy: 


// matrix addition in-place 
ILmatrix & IbLmatrix::operator +=(const ILmatrix & b) 
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int ey ee 


for (a = 0) 29 < mine +e} 
for (c= 0» © < benece Fe) 
(Fthis) Tel fel) += Shel pers 


return *this; // veturns by reference, NO copy! 


} 


This code runs in 0.45 + 0.02 s, or 1/3 the original time! This C++98 price, though, is somewhat uglier user 
code. 


Move semantics: C++11 defines a new feature called “move semantics,” that is designed to reduce the 
burden of the unnecessary copies, and therefore can avoid the code ugliness above. Note that it still retains 
the unnecessary copies, but it sometimes allows them to be less burdensome. Essentially, “move semantics” 
are only helpful if the class includes dynamically allocated memory. If you simply have large objects with 
fixed memory, “move semantics” do nothing, and you’re stuck with the ugly method above. 


“Move semantics” requires us to define two more “class special” methods: a “move constructor,” and a 
“move assignment” method. The key element of move semantics is that the two “move” methods are called 
only when the right hand operand is about to be destructed, so there is no need to preserve its data. You can 
simply “hand-off” the data block from the right-operand to the destination object. The right-operand can be 
left empty, because it’s about to be destructed anyway. We leave the details to C++11 tutorials. (We propose 
adding a C++ keyword “that”, that would very simply avoid all the copies all the time, without requiring any 
“move semantics,” and without any new “class special” functions.) 


Attempted optimization 3: Perhaps we can do even better. Instead of using operator functions, which 
are limited to only two matrix arguments, we can write our own addition function, with any arguments we 


want. We can then eliminate all the needless copies. The main code is now: 
mat_add(d, a, b); // add a+b, putting result in ‘d’ 


Requiring the new function “mat_add()”: 


// matrix addition to new matrix: d=a+b 
ILmatrix & mat_add(ILmatrix & d, const ILmatrix & a, const ILmatrix & b) 


int ry, Cer 


for (r= 0; x» < dines br) 
for (c= 0) © < dine? thes 
ale) be] = ale) [el + Ble] bely 


return d; // veturned by reference, NO copy constructor 
} // mat_add() 


This runs in 0.49 + 0.02 s, slightly worse than the one-copy version. It’s also even uglier for users than the 
previous version. How can this be? 


Memory access, including data copying, is dominated by the effects of 
a complex piece of hardware called “memory cache.” 


There are hundreds of different variations of cache designs, and even if you know the exact design, you can 
rarely predict its exact effect on real code. We will describe cache shortly, but there is no feasible way to 
know exactly why the zero-copy code is slower than one-copy. This result also held true for the 400 x 400 
matrix on computer-1, and the 300 x 300 matrix on computer-2, but not the 400 x 400 matrix on computer- 
2. In the end, all we can do is try a few likely optimizations, and keep the ones that tend to perform well. 
More on this later. 


Beware Leaving out a single character from your code can produce code that works, but runs over 2 times 
slower than it should. For example, in the function definition of mat_addQ, if we leave out the 
“&” before argument ‘a’: 
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ILmatrix & mat_add(ILmatrix & d, const ILmatrix a, const ILmatrix & b) 


then the compiler passes ‘a’ to the function by copying it! This completely defeats our goal of zero copy. 
[Guess how I found this out. ] 


Also notice that the ‘memcpy( )’ optimization doesn’t apply to this last method, since it has no copies at 
all. 


Below is a summary of our matrix addition optimizations. The best performance was usually a single 
copy, with in-place addition. It is medium ugly. While there was a small performance discrepancy on 
computer-2 at 400 x 400, the zero-copy code is not worth the required additional ugliness. 


1360=100% |  5900= 100 % 1130=100% | 2180=100% 


Figure 15.1 Run times for matrix addition with various algorithms. Best performance is 
highlighted. 


Memory Consumption vs. Run Time: Cache as Cache Can 


In the old days, the claim was clear (but not the reality): the less memory you use, the slower your 
algorithm, and speeding your algorithm requires more memory. 


The fallacy there is that most code is not well written. When you clean up code, you often create 
implementations that are more efficient in both memory and time. I have personally done this many times, 
even when revising my own code. 


However, given reasonably efficient implementations, then with no memory cache (described below), 
one can usually speed a computation by using an algorithm that requires more memory. Conversely, an 
algorithm that uses less memory is usually slower. 


But finally, since all modern computers are heavily affected by cache, we must take it into account. If 
you “exceed the cache”, i.e. your algorithm repeatedly works through more memory than the cache can hold, 
you will suffer a dramatic slow-down in speed. In such a case, an algorithm that uses less memory may be 
faster: the algorithmic performance loss may be offset by the cache performance increase, possibly many 
times over. 


Cache Value 


Before about 1990, computations were slower than memory accesses. Therefore, we optimized by 
increasing memory use, and decreasing computations. Today, things are exactly reversed: a 2.5 GHz 
processor has a compute cycle time of 0.4 ns, but a main memory access takes around 40 ns. 


To help reduce the speed degradation of slow main memory, computers use memory caches: small memories 
that are very fast (Figure 15.2). A cache system is usually 3 levels of progressively bigger and slower caches. 
A typical main memory is 16 GB, while a typical cache (system) is 1-10 MB, or more than 1000x smaller. 
The CPU can often access level 1 (L1) cache memory as fast as it can compute, so L1 cache can be ~100x 
faster than main memory. There is huge variety in computer memory designs, so the description below is 
general. Behavior varies from machine to machine, sometimes greatly. Our data below demonstrate this. 
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The cache is invisible to program function, but is critical to program speed. The programmer usually 
does not have access to details about the cache, but she can use general cache knowledge to greatly reduce 
run time. (As noted earlier, supercomputer programmers are usually required to use special diagnostics to 
insure programs use cache efficiently.) 


sequential sequential small, faster | L3 
0 0 RAM caches } IL? 
l 5 matrix 1 1 
A a — Main 
2 10 memory: 
RAM aa a big, slow 
; ata pal 
matrix RAM 
N-l N-5 . 
(a) (b) (c) 


Figure 15.2 (a) Computer memory (RAM) is a linear array of bytes. (b) For convenience, we draw 
RAM as a 2D array, of arbitrary width. We show sample matrix storage. (c) A fast memory cache 
system keeps a copy of recently used memory locations, so they can be quickly used again. 


The cache does two things (Figure 15.2): 


1. Cache remembers recently used memory values, so that if the CPU requests any of them again, the 
cache provides the value quickly, and the slow main memory access does not happen. 


2. Cache “looks ahead” to fetch memory values immediately following the one just used, before the 
CPU might request it. If the CPU in fact later requests the next sequential memory location, the 
cache provides the value quickly, having already fetched it from slow main memory. 


The cache is small, and eventually fills up. Then, when the CPU requests new data, the cache must discard 
old data, and replace it with the new. Therefore, if the program jumps around memory a lot, the benefits of 
the cache are reduced. If a program works repeatedly over a small region of memory (say, a few hundred-k 
bytes), the benefits of cache increase. Typically, cache can follow at least four separate regions of memory 
concurrently. This means you can interleave accesses to four different regions of memory, and still retain 
the benefits of cache. Therefore, we have three simple rules for efficient memory use: 


For efficient memory use: (1) access memory sequentially, or in small steps, 


(2) reuse values as much as possible in the shortest time, and 
(3) access few memory regions concurrently, preferably no more than four. 


Cache Benefits 


We can now understand some of our timing data given above. We see that the one-copy algorithm 
unexpectedly takes less time than the zero-copy algorithm. The one-copy algorithm accesses only two 
memory regions at a time: first matrix ‘a’ and ‘d’ for the copy, then matrix ‘b’ and ‘d’ for the add. The zero- 
copy algorithm accesses three regions at a time: ‘a’, ‘b’, and ‘d’. This is probably reducing cache efficiency. 
Recall that the CPU is also fetching instructions (the program) concurrently with the data, which can be a 
fourth region. Exact program layout in memory is virtually impossible to know. Also, not all caches support 
4-region concurrent access. The newer machine, computer-2, probably has a better cache, and the one- and 
zero-copy algorithms perform very similarly. 


Order matters: Here’s a new question for matrix addition: the code given earlier loops over rows in 
the outer loop, and over columns in the inner loop. What if we reversed them, and looped over columns on 
the outside, and rows on the inside? The result is 65% /onger run time, on both machines. Here’s why: the 
matrices are stored by rows, i.e. each row is consecutive memory locations (Figure 15.2b). Looping over 
columns on the inside accesses memory sequentially, taking advantage of cache look-ahead. When reversed, 
the program jumps from row to row on the inside, giving up any benefit from look-ahead. The cost is quite 
substantial. Sequential memory access speeds code on almost every machine. The code behavior of making 
mostly nearby memory references is called reference locality (aka “data locality’). 
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Caution Fortran stores arrays in the opposite order from C and C++. In Fortran, the first index is cycled 
most rapidly, so you should code with the outer loop on the second index, and the inner loop on 
the first index. E.g., 


DO C= 1, N 
DO R= 1, N 
A(R, CC) = Blah blah .o.« 
ENDDO 
ENDDO 


Scaling behavior: Matrix addition is an O(N’) operation, so increasing from 300 x 300 to 400 x 400 
increase the computations by a factor of 1.8. On the older computer-1, the runtime penalty is much larger, 
between 4.5x and 9x slower. On the newer computer-2, the difference is much closer to the expected: 
between 1.8x and 2.2x slower. This discrepancy likely due to cache size. A 300 x 300 double precision 
matrix takes 720k bytes, or under a MB. A 400 x 400 matrix takes 1280k bytes, just over one MB. It could 
be that on computer-1, with the smaller matrix, a whole matrix or two fits in cache, but with the large matrix, 
cache is overflowed, and more (slow) main memory accesses are needed. The newer computer probably has 
bigger caches, and may fit both sized matrices fully in cache. 


Cache Withdrawal: Making the Most of Reference Locality 


We now show that the above tricks don’t work well for, say, large-matrix multiplication, but a different 
trick cuts multiplication run time dramatically. This example uses matrix multiplication to illustrate the 
importance of reference locality, but you can apply this principle to any code that accesses lots of memory. 
(In general, you should not write your own matrix multiplication code; there are lots of free libraries to do it, 
that are optimized with the following method, as well as others. On the other hand, some “free” code isn’t 
optimized at all, and will perform terribly.) 


To start the example, we use a simple matrix multiply in the main code: 
a= aq De 


The straightforward matrix multiply operator is this: 
// Simple matrix multiply to temporary 
ILmatrix ILmatrix::operator *(const ILmatrix & b) const 


{ 


Int Po ip HEF 
ILmatrix result (nr, b.nc); // temporary for result 
7 sum; 
for(s =] 0. #:< Mee aba) 
{ for(e = 0» ¢< bane? +o) 
{ sum = 0.; 
for(k = 0; k < ne; ++k) sum += (*this) [r][k] * bk] [el]; 
result[r][c] = sum; 
} 
} 
return result; // invokes copy constructor! 


} // operator *() 


While matrix addition is an O(N’) operation, matrix multiplication is O(N’). Multiplying two 300 x 300 
matrices is about 54,000,000 floating point operations, which is much slower than addition. Timing the 
simple multiply routine, similarly to timing matrix addition but with only 5 multiplies, we find it takes 7.8 + 
0.1 s on computer-1. 


First we try the tricks we already know to improve and avoid data copies: we started already with 
memcpy(). For matrix multiply, there is no one-copy option: you can’t multiply “in place” because you 
would overwrite the input with results when you still needed the input. In other words, there is no ‘*=’ 
operator for matrix multiply. So we compare the two-copy and zero-copy algorithms as with addition, but 
this time the 4 trials show no measurable difference. Matrix multiply is so slow that the copy times are 
insignificant. Therefore, we choose the two-copy algorithm, which is easiest on the user. We certainly drop 
the ugly 3-argument mat_mult( ) function, which gives no benefit. 
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Now we’ll improve our matrix multiply greatly, by adding more work to be done. The extra work will 
result in more efficient memory use, that pays off handsomely in reduced runtime. Notice that in matrix 
multiplication, for each element of the result, we access a row of the first matrix a, and a column of the 
second matrix b. But we learned from matrix addition that accessing a column is much slower than accessing 
arow. And in matrix multiplication, we have to access the same column N times. Extra bad. If only we 
could access both matrices by rows! 


Well, we can. We first make a temporary copy of matrix b, and transpose it. Now the columns of b 
become the rows of b’. We perform the multiply as rows of a with rows of b’. We’ve already seen that copy 
time is insignificant for multiplication, so the cost of one copy and one transpose (similar to a copy) is 
negligible. But the benefit of cache look-ahead is large. The transpose method reduces runtime by 30% to 
50%. 


Further thought reveals that we only need one column of b at atime. We can use it N times, and discard 
it. Then move on to the next column of b. This reduces memory usage, because we only need extra storage 
for one column of b, not for the whole transpose of b. It costs us nothing in operations, and reduces memory. 
That will probably help our cache performance. In fact, on computer-1, it cuts runtime by almost another 
factor of two, to about one third of the original runtime. It has little effect on computer-2. (It does require 
us to loop over columns of b on the outer loop, and rows of a on the inner loop, but that’s no burden.) 


Note that optimizations that at first were insignificant, say reducing runtime by 10%, may become 
significant after the runtime is cut by a factor of 3. That original 10% is now 30%, and may be worth doing. 


Computer-2 times (ms, + ~ 100 ms) 
7760 = 100% | 18,260 = 100% 5348 = 100% | 16,300 = 100% 


d=a*b, 4580 = 59 % 12,700 = 70 % 2900 = 54 % 7800 = 48% 
transpose ‘b’ 

d=a*b, 2710 = 35 % 7875 = 43 % 3100 =58 % 8000 = 49 % 
copy ‘b’ column 


Figure 15.3 Run times for matrix multiplication with various algorithms. Best performing 
algorithms are highlighted 


Data Structures for Efficient Cache Use 


Here is a simple example of organizing data structures to improve cache efficiency. Consider a set of 
data points, each comprising 4 numbers, A, B, C, and D. In the old days, programmers often used “parallel 
arrays” to store the data: 

double a[10000], b[10000], c[10000], d[10000]; 


Suppose the processing requires looping through all the points, processing A, B, C, and D for each point. 
The parallel arrays above spread a single data point far across memory, the worst thing you can do for cache 


efficiency. Much better is: 
struct dpoint 
{ double a; 


double b; 
double c; 
double d; 
} 
dpoint data[10000]; (al rege 


Though this takes many more lines of code, it is conceptually clearer, and keeps data from each point 
physically adjacent in memory. When looping through the data points, cache prefetch works well to reduce 
memory delays. However, it does clutter your code because a simple a[i] becomes the clumsy 
data[i].a. 
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Algorithms for Efficient Cache Use 


Many algorithms require multiple passes over a list of data (e.g., mergesort for sorting a list of data in 
ascending order). Rather than simply iterate over the whole list multiple times, it is faster to stay within a 
cache line (probably 16 - 128 bytes) for multiple iterations, and then move on. The simple, slow way might 
be: 

for(int pass = 1; pass < NPASSES; ++pass) 
for(int dx = O; dx < NDATA; ++dx) 
(...process data[dx]); 


A faster way might be: 
for(int start = 0; start < NDATA; start += LINESIZE) 
for(int pass = 1; pass < NPASSES; ++tpass) 
for(int dx = start; dx < start + LINESIZE; ++dx) 
(... process data[dx]); 


This might make you cringe, because you’ve replaced two nested loops with three. However, the cache 
locality of the 3-loop code is much better, and may well overcome the small increase in loop initialization. 
It helps if you know the cache line size of the target machine, and beware that running code on a machine 
with a smaller line size than you coded for will probably be worse than the simple 2-loop code. 


Cache Optimization Summary 


In the end, exact performance is nearly impossible to predict. However, general knowledge of cache, 
and following the three rules for efficient cache use (given above), will greatly improve your runtimes. 


Conflicts in memory among pieces of data, and within layout of instructions, cannot be precisely 


controlled. Sometimes even tiny changes in code will cross a threshold of cache, 
and cause huge changes in performance. 


Remember the 90/10 rule: 90% of your CPU time is spent in only 10% of your code (often its closer to 
95/5). Focus on the small fraction of code that matters; don’t optimize code that doesn’t run enough to matter. 


Virtual Memory and Page Locality 


Besides RAM cache, there is another kind of reference locality that affects computing speed: Virtual 
Memory pages. All modern computers divide main memory (RAM) into pages: contiguous blocks of 
memory, of typically 4 KB (Figure 15.4). Your application’s memory is also divided up into such pages, 
though you don’t (usually) need to know it. However, as with RAM cache, knowledge is power. To some 
extent, you can control, your page layout to improve speed. 


translation lookaside 
buffer (TLB) 


Figure 15.4 Virtual memory is the model of memory seen by an application. The actual storage 
for virtual memory is both physical RAM, and disk. The gray boxes in RAM are not part of the 
application’s VM, but may be used by the OS or other applications. Cache is omitted for simplicity. 


The basic idea behind improved page performance is the same as with RAM cache: keeping your data 
together on fewer pages allows faster execution. Specifically, there are two reasons page locality improves 
speed: improved Translation Lookaside Buffer hit rate, and improved page hit rate. Before describing these 
two aspects, we must have an overview of Virtual Memory (VM). 
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Virtual Memory: Virtual Memory exists to allow operating systems to efficiently allocate memory to 
applications, and to give the applications access to more memory than the system has physical RAM. An 
application on a modern system “sees” a model of memory where a contiguous region of memory occupies 
contiguous addresses. For example, an application may see 4,096,000 bytes (4 MB) of memory, with virtual 
addresses from 0 to 4,095,999. This is virtual memory: the memory model seen by the application. The 
actual storage for an application’s VM is often scattered all over physical RAM. Every time an application 
requests access (read or write) to a virtual address, the CPU must lookup that address in a page table that tells 
the CPU where the data is actually stored. If the data is in physical RAM, access is very fast. The CPU uses 
the page table to “translate” the virtual address to its physical RAM address, and fulfills the request. 


To make management feasible, VM is divided into pages, or contiguous block of memory, typically 
about 4 KB long (as of 2018). The page table then has one entry for each page of virtual memory. This 
keeps the page table a feasible size. The page table itself (usually) resides in physical RAM. 


All of an application’s (virtual) memory may not fit in RAM simultaneously, so the OS may keep some 
of it out on disk. The OS maintains the page table to either point to a virtual page’s physical RAM address, 
or to note that the page is currently on disk. CPU access to the page table is a main RAM access, and can be 
slow, as long as 30 - 100 cycles (~10- 30 ns). If this happened on every memory access, it would be crippling. 


The saving grace is a Translation Lookaside Buffer (TLB), which is just a cache of page table 
translations. When the CPU needs to translate a virtual address to physical, it checks the TLB. If found, the 
CPU gets the physical address instantly, so the CPU runs at full speed. If the virtual page is not in the TLB, 
the CPU must consult the page table in RAM, and suffer the speed penalty. A TLB typically has 32 - 128 
entries (as of 2018). A virtual address that is found in the TLB is called a TLB “hit;” a virtual address not 
found in the TLB is a TLB “miss” (similar to “cache hit”, and “cache miss”.) Typical TLB miss rates are 
0.1% to 20% or more (for badly-written or uncooperative code). 


Of course, after a TLB miss, the CPU puts the translation from the page table into the TLB, so the TLB 
is ready for a future access. To update the TLB, the CPU must overwrite an old entry with the new one. 


Fault lines: There’s a chance that when the CPU consults the page table, it will find that the data is not 
in physical RAM, but out on disk. This is called a page-fault. It requires a ~30 ms disk access, which is a 
million times slower than a TLB miss. This is perfectly normal, but slow. The ability to put virtual pages on 
disk allows an application to operate with more virtual memory than the system has physical RAM, because 
all the virtual memory is not simultaneously held in physical RAM. VM is fetched as needed. 


On a page fault, the CPU takes an interrupt to notify the OS of the fault. The OS identifies a virtual page 
in physical RAM to “flush;” the OS writes (flushes) the data from the current page to disk, and reads the new 
page from disk into RAM. The OS updates the page table for both virtual pages, and the application can 
continue. 


Be careful to distinguish a “page-fault” from a “segmentation fault” or other access violation. A page- 
fault is normal; a “seg-fault” is a fatal application bug where it tries to access non-existent virtual memory, 
usually forcing the OS to terminate the application. 


Your mission, should you decide to accept it: The first way to minimize page misses is to organize 
data into related structures, as in code (15.1). Fortunately, this is the same advice as for cache locality, so if 
you care about speed, you’ve already done this. 


Similarly, if you must access many different pages, it can be a huge benefit to access within just a few 
pages as much as possible, before moving on to the next set. This maximizes TLB and page-table hit rates 
(thus minimizing page faults). For example, if using a merge sort algorithm, sort all the values within each 
page first, and only then move on to merging from the pages. 


To code specifically for such specific page locality, you must know some details about your system, 
especially the page size. This is readily available in documentation, and in some system calls, e.g., Unix 
sysconf(_SC_PAGESIZE). As of 2018, the page size on almost all machines is 4 KB. 


Finally, using a simple sorted data array (or STL vector) for a lookup table can sometimes be much faster 
than an “optimized” associative container, because the array might be much smaller, and is certainly more 
localized in virtual address space. See [Mey 2001, Item 23, p100-3]. 
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Considerations on Run-Time Efficiency and Languages 


In our example in C++98, it is possible to implement matrices to pass pointers to data blocks, and do 
garbage collections, etc. This would make all operations zero-copy, but is much more work. Again, C++11 
“move semantics” make this easier than before, but still add complexity, don’t always solve the problem, and 
often only partially solve it. 


Also, a minor extension to the C++ language would eliminate one of the copies. For operator methods, 
C++ could allow trinary functions: operand:, operand2, destination. (This would simply be a new allowed 
prototype for operator methods. I’m surprised no one has proposed this as a new C++ language feature.) I 
have read that some compilers are smart enough to optimize ‘c = a + b’ into passing ‘c’ as the destination 
object directly to the ‘+’ method function, thus avoiding the assignment copy. This is only possible, though, 
if the compiler can guarantee that ‘c’ itself is not involved in the method function, which is not always 
possible. Also, there are nontrivial construction/destruction issues involved in such an optimization, and the 
compiler must insure the integrity of all this (e.g., copy construction is different than assignment). C++17 
mandates some of these optimizations, called “copy elision” (that actually makes the language self- 
contradictory, because constructors are sometimes bypassed even if they have side-effects). 


Note that Fortran by default has no copy of a function return object, and so avoids this C++ burden. 
Coders who want a copy (it happens) can easily add it themselves. 


Some scientists believe Fortran (no longer all-caps) is still faster than C++, though I found little evidence 
to support that, in general (as of 2018). Fortran’s array slicing is often easier to code and faster to run than 
C++ or Python. Fortran also has strong support for parallel processing. 


Python is an interpreted byte-code language, like Java. Your source is compiled into a byte code, which 
is interpreted as it runs. It is inherently about 20-50x slower than languages compiled into direct machine 
instructions, like C, C++, and Fortran. Python is feasible for large data computations only when the 
computations are mostly standard functions like matrix multiply, inversion, etc. Standard functions exist as 
fast Python libraries that you can call (they are usually written in C). However, not all computations fit into 
standard mathematical routines. For example, in my dissertation, I wrote C++ data reduction code for a 
physics experiment, and independently, someone else wrote Python code to do the same. The C++ code ran 
20x faster than Python. I could not have completed my dissertation, nor earned a PhD, in Python. 


Use the right hammer for the job. Don’t be a zealot. 
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16 Tensors, Without the Tension | 


Approach 


We’ll present tensors as follows: 

1. Two physical examples: magnetic susceptibility, and deformable solids 

A non-example: when is a matrix not a tensor? 

Forward looking definitions (don’t get stuck on these) 

Review of vector spaces and notation (don’t get stuck on this, either) 

A short, but at first unhelpful, definition (really, really don’t get stuck on this) 
A discussion which clarifies the above definition 

Examples, including dot products and cross-products as tensors 


Higher rank tensors 


OOO. ON Oe eS 


Change of basis 

10. Non-orthonormal systems: contravariance and covariance 

11. Indefinite metrics of Special and General Relativity 

12. Mixed basis linear functions (transformation matrices, the Pauli vector) 


Tensors are all about vectors. They let you do things with vectors you never thought possible. We 
define tensors in terms of what they do (their linearity properties), and then show that linearity implies the 
transformation properties. This gets most directly to the true importance of tensors. [Most references define 
tensors in terms of transformations, but then fail to point out the all-important linearity properties. ] 


We also take a geometric approach, treating vectors and tensors as geometric objects that exist 
independently of their representation in any basis. Inevitably, though, there is a fair amount of unavoidable 
algebra. 


Later, we introduce contravariance and covariance in terms of non-orthonormal coordinates, but first 
with a familiar positive-definite metric from classical mechanics. This makes for a more intuitive 
understanding of contra- and co-variance, before applying the concept to the more bizarre indefinite metrics 
of special and general relativity. 


There is deliberate repetition of several points, because it usually takes me more than once to grok 
something. So I repeat: 


Two Physical Examples 


We start with two physical examples: magnetic susceptibility, and deformation of a solid. We start with 
matrix notation, because we assume it is familiar to you. Later we will see that matrix notation is not ideal 
for tensor algebra. 


Magnetic Susceptibility 


We assume you are familiar with susceptibility of magnetic materials: when placed in an H-field, 
magnetizable (susceptible) materials acquire a magnetization, which adds to the resulting B-field. In simple 
cases, the susceptibility y is a scalar, and 
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M = vyH where Mis the magnetization, 
w is the susceptibility, and 
H is the applied magnetic field 


The susceptibility in this simple case is the same in any direction; i.e., the material is isotropic. 


However, there exist materials which are more magnetizable in some directions than others. E.g., 
imagine a cubic lattice of axially-symmetric molecules which are more magnetizable along the molecular 
axis than perpendicular to it: 


oOo = =oOo = oOo 


e@e e@e ee y » » 
oOo = 0Qe = oo — th 

ee eGo eo H 

| ++ +€- +6 x x Z 
e@e eGe ee 


=> 
more magnetizable 


less magnetizable 


Xxx = 2 Xyy =1 Xe = 1 


Magnetization, M, as a function of external field, H, for a material with a tensor-valued 
susceptibility, x. 


In each direction, the magnetization is proportional to the applied field, but y is larger in the x-direction than 
y or z. In this example, for an arbitrary H-field, we have 


20 0 
H.) or M=y7H=|0 1 OjH. 
001 
ee+—_>»> 
XEXij 


M =(M,,M,,M,)=(2H,,H,, 


Note that in general, M is not parallel to H (below, dropping the z axis for now): 


H 


JL: = (2H, H,) 


Xx 


M need not be parallel to H for a material with a tensor-valued y. 


But Mis a linear function of H, which means: M(kH, + H,)=kM(H,)+M(H,). 


This linearity is reflected in the fact that matrix multiplication is linear: 


20 0 2 0 20 0 
M(kH,+H5)=/0 1 0|(KH,+H,)=k/0 1 O/H, +/0 1 0}H,=kM(H,)+M(H). 
001 001 001 


The matrix notation might seem like overkill, since x is diagonal, but it is only diagonal in this basis of 
x, y, and z. We’ll see in a moment what happens when we change basis. First, let us understand what the 
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matrix yj really means. Recall the visualization of pre-multiplying a vector by a matrix: a matrix x times a 
column vector H, is a weighted sum of the columns of ¥: 


Xx <4 Xx xy 
yH = Xx yx Xx yy Xx yz 
Xx bas Xx zy Xx z 


We can think of the matrix y as a set of 3 column vectors: the first is the magnetization vector for H = e,; the 
2™4 column is M for H = e,; the 3 column is M for H = e,. Since magnetization is linear in H, the 
magnetization for any H can be written as the weighted sum of the magnetizations for each of the basis 
vectors: 


M(H) = HM(e y+ H,M(e )+ H.M(e.) where e,,€,,@, are the unit vectors in x, y, z. 


x y 


This is just the matrix multiplication above: M =yH . (We’re writing all indexes as subscripts for now; later 
on we’ll see that M, x, and H should be indexed as M ', y ‘;, and H''.) 


Now let’s change bases from ey, ey, €,, to some e1, €2, €3, defined below. We use a simple transformation, 
but the 1-2-3 basis is not orthonormal: 


ce 


old basis new basis 


Transformation to a non-orthogonal, non-normal basis. e; and e2 are in the x-y plane, but are neither 
orthogonal nor normal. For simplicity, we choose e3 = e;. Here, b and c are negative. 


To find the transformation equations to the new basis, we first write the old basis vectors in the new 
basis. We’ve chosen for simplicity a transformation in the x-y plane, with the z-axis unchanged: 


e, =ae, + be, e, = ce, + de, e, =e3. 


Now write a vector, v, in the old basis, and substitute out the old basis vectors for the new basis. We see that 
the new components are a linear combination of the old components: 


V=V,€, tv e, +V,€, =V, (ae, +be,)+ Vy (ce, + de,)+v,e3 
—-—_—~- —-——_, 
e, e, 
= (av, + cv, )e, + (dv, + dv, Jey + V3 = ve; + Vo€> + V3€3 


=> vj =av,+cvy, Vp = bv, + dvy, V3 =V, 


Recall that matrix multiplication is defined to be the operation of linear transformation, so we can write this 
basis transformation in matrix form: 
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vy ac O)ly, a c 0 
vg |=|5b d Oflvy |=v,]b|+v,]} d |+v,] 0 
V3 0 0 I)}ly, 0 0 1 

e. e, €. 


The columns of the transformation matrix are the old basis vectors written in the new basis. 


This is illustrated explicitly on the right hand side, which is just v,e, +v,e, +Vv.e.. 


Finally, we look at how the susceptibility matrix y; transforms to the new basis. We saw above that the 
columns of y are the M vectors for H = each of the basis vectors. So right away, we must transform each 
column of x with the transformation matrix above, to convert it to the new basis. Since matrix multiplication 
A‘B is distributive across the columns of B, we can write the transformation of all 3 columns in a single 
expression by pre-multiplying with the above transformation matrix: 


ac 0 ac O\(/2 0 0 2a c 
Step 1 of x"*" =yin new basis=|b d Oly=|b d O}]0 1 O|=)/2b d O 
0 0 1 0 0 1)\0 0 1 0 0 1 


But we’re not done. This first step expressed the column vectors in the new basis, but the columns of 
the RHS (right hand side) are still the M’s for basis vectors e;, ey, e.. Instead, we need the columns of 7“ to 
be the M vectors for e1, €2, e3. Please don’t get bogged down yet in the details, but we do this transformation 
similarly to how we transformed the column vectors. We transform the contributions to M due to ex, ey, e; 
to that due to e; by writing e; in terms of e, ey, ez: 


e=ce,+ fe, => M(H =e,)=eM(H=e,)+ /M(H=e,). 
Similarly, 

e,=ge,+he, => M(H =e,) = gM(H=e,)+/M(H =e, ) 

e, =e, => M(H =e,)=M(H=e.) 


Essentially, we need to transform among the columns, i.e. transform the rows of y. These two transformations 
(once of the columns, and once of the rows) is the essence of a rank-2 tensor: 


A tensor matrix (rank-2 tensor) has columns that are vectors, and simultaneously, its rows are also 


vectors. Therefore, transforming to a new basis requires two transformations: 
once for the rows, and once for the columns (in either order). 


[Aside: The details (which you can skip at first): We just showed that we transform using the inverse of our 
previous transformation. The reason for the inverse is related to the up/down indexes mentioned earlier; please be 
patient. In matrix notation, we write the row transformation as post-multiplying by the transpose of the needed 
transformation: 


T 


[Another aside: A direction-dependent susceptibility requires y to be promoted from a scalar to a rank-2 tensor 
(skipping any rank-1 tensor). This is necessary because a rank-0 tensor (a scalar) and a rank-2 tensor can both act 
on a vector (H) to produce a vector (M). There is no sense to a rank-1 (vector) susceptibility, because there is no 
simple way a rank-1 tensor (a vector) can act on another vector H to produce an output vector M. More on this 
later. ] 
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Mechanical Strain 


A second example of a tensor is the mechanical strain tensor. When I push on a deformable material, it 
deforms. A simple model is just a spring, with Hooke’s law: 
1 
Ax=+ 7 applied f 
We write the formula with a plus sign, because (unlike freshman physics spring questions) we are interested 
in how a body deforms when we apply a force to it. For an isotropic material, we can push in any direction, 
and the deformation is parallel to the force. This makes the above equation a vector equation: 


1 ‘ 
Ax = sF where s= ri = the strain constant . 


Strain is defined as the displacement of a given point under force. [Stress is the force per unit area 
applied to a body. Stress produces strain.] In an isotropic material, the stress constant is a simple scalar. 
Note that if we transform to another basis for our vectors, the stress constant is unchanged. That’s the 
definition of a scalar: 


A scalar is a number that is the same in any coordinate system. A scalar is a rank-0 tensor. 


The scalar is unchanged even in a non-ortho-normal coordinate system. 


But what if our material is a bunch of microscopic blobs connected by stiff rods, like atoms in a crystal? 


a 
O 
o 6 
O 
oS 
O O @ 


& 
eo  ©@ 
op 
eo  @ 
! © 
© 


(Left) A constrained deformation crystal structure. (Middle) The deformation vector, Ax, is not 
parallel to the force. (Right) More extreme geometries lead to a larger angle between the force and 
displacement. 


The diagram shows a 2D example: pushing in the x-direction results in both x and y displacements. The 
same principle could result in a 3D Ax, with some component into the page. For small deformations, the 
deformation is /inear with the force: pushing twice as hard results in twice the displacement. Pushing with 
the sum of two (not necessarily parallel) forces results in the sum of the individual displacements. But the 
displacement is not proportional to the force (because the displacement is not parallel to it). In fact, each 
component of force results in a deformation vector. Mathematically: 


Syy Syy Syz Sixx Sxy Sxz F. 
AX=F | Byy [+ Py | Sy | Ee] See Hl ye Sy Se || Fy |= SF 
Sex Szy Soe See Say Sap F, 
iS) 


Much like the anisotropy of the magnetization in the previous example, the anisotropy of the strain requires 
us to use a rank-2 tensor to describe it. The /inearity of the strain with force allows us to write the strain 
tensor as a matrix. Linearity also guarantees that we can change to another basis using a method similar to 
that shown above for the susceptibility tensor. Specifically, we must transform both the columns and the 
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rows of the strain tensor s. Furthermore, the linearity of deformation with force also insures that we can use 
non-orthonormal bases, just as well as orthonormal ones. 
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When Is a Matrix Not a Tensor? 


I would say that most matrices are not tensors. A matrix is a tensor when its rows and columns are both 
vectors. This implies that there is a vector space, basis vectors, and the possibility of changing basis. As a 
counter example, consider the following graduate physics problem: 


Two pencils, an eraser, and a ruler cost $2.20. Four pencils, two erasers, and a ruler cost $3.45. Four 
pencils, an eraser, and two rulers cost $3.85. How much does each item cost? 


We can write this as simultaneous equations, and as shorthand in matrix notation: 
2ptetr=220 2 1 1)Vp 220 
4p+2e+r=345 or 4 2 Ife |=} 345 
4pt+et+2r=385 Ll Le 385 


It is possible to use a matrix for this problem because the problem takes linear combinations of the costs of 
3 items. Matrix multiplication is defined as the process of linear combinations, which is the same process as 
linear transformations. However, the above matrix is not a tensor, because there are no vectors of school 
supplies, no bases, and no linear combinations of (say) part eraser and part pencil. Therefore, the matrix has 
no well-defined transformation properties. Hence, it is a lowly matrix, but no tensor. 


However, later (in “We Don’t Need No Stinking Metric”) we’ll see that under the right conditions, we 
can form a vector space out of seemingly unrelated quantities. 
Heading In the Right Direction 
An ordinary vector associates a number with each direction of space: 
V=VXTVS+VZ. 
The vector v associates the number v, with the x-direction; it associates the number v, with the y-direction, 
and the number v-, with the z-direction. 


The above tensor examples illustrate the basic nature of a rank-2 tensor: 


A rank-2 tensor associates a vector with each direction of space: 


Tx Try T.. 
DS | Toe (RA) Typ (8) Eye (2s 
Tx Ty T. 


Some Definitions and Review 
These definitions will make more sense as we go along. Don’t get stuck on these: 
“ordinary” vector = contravariant vector = contravector = (19) tensor 
1-form = covariant vector = covector = (°:) tensor. (Yes, there are 4 different ways to say the same thing.) 


covariant the same. E.g., General Relativity says that the mathematical form of the laws of physics 
are covariant (i.e., the same) with respect to arbitrary coordinate transformations. 
This is a completely different meaning of “covariant” than the one above. 


rank The number of indexes of a tensor; TY is a rank-2 tensor; Rij is a rank-4 tensor. Rank is 
unrelated to the dimension of the vector space in which the tensor operates. 


MVE mathematical vector element. Think of it as a vector for now. 
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Caution: a rank (°;) tensor is a 1-form, but a rank (°2) tensor is not always a 2-form. [Don’t worry about 
it, but just for completeness, a 2-form (or any n-form) has to be fully anti-symmetric in all pairs of vector 
arguments. ] 


Notation: 
(a, b, c) is arow vector; (a, b, c)’ is a column vector (the transpose of a row vector). 


To satisfy our pathetic word processor, we write (,), even though the ‘m’ is supposed to be directly 
above the ‘n’. 


T is a tensor, without reference to any basis or representation. 

Ti is the matrix of components of T, contravariant in both indexes, with an understood basis. 
T(v, w) is the result of T acting on v and w. 

vVorv are two equivalent ways to denote a vector, without reference to any basis or representation. 


Note that a vector is a rank-1 tensor. 


a or a~ are two equivalent ways to denote a covariant vector (aka 1-form), without reference to 


any basis or representation 


di the components of the covecter (1-form) a, in an understood basis. 


Vector Space Summary 


Briefly, a vector space comprises a field of scalars, a group of vectors, and the operation of scalar 
multiplication of vectors (details below). Quantum mechanical vector spaces have two additional 
characteristics: they define a dot product between two vectors, and they define linear operators which act on 
vectors to produce other vectors. 


Before understanding tensors, it is very helpful, if not downright necessary, to understand vector spaces. 
Quirky Quantum Concepts has a more complete description of vector spaces. Here is a very brief summary: 
a vector space comprises a field of scalars, a group of vectors, and the operation of scalar multiplication of 
vectors. The scalars can be any mathematical “field,” but are usually the real numbers, or the complex 
numbers (e.g., quantum mechanics). For a given vector space, the vectors are a class of things, which can be 
one of many possibilities (physical vectors, matrices, kets, bras, tensors, ...). In particular, the vectors are not 
necessarily lists of scalars, nor need they have anything to do with physical space. Vector spaces have the 
following properties, which allow solving simultaneous linear equations both for unknown scalars, and 
unknown vectors: 


Scalars Mathematical Vectors 


Scalars form a commutative group Vectors form a commutative group 
(closure, unique identity, inverses) under (closure, unique identity, inverses) 
operation +. under operation +. 


Scalars, excluding 0, form a commutative 


group under operation (- ). 


Distributive property of (- ) over +. 


Scalar multiplication of vector produces another vector. 


Distributive property of scalar multiplication over both scalar + and vector +. 


With just the scalars, you can solve ordinary scalar linear equations such as: 
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ay |x, Ss AyyX> + yy Xp = Cy 


Ay) X1 + AnnX5 +...d5,X, =C 
211 22°62 2. 2 . 7 7 
ae written in matrix form as ax=c. 


Ay X + AygXq +++ Ann Xn = Cy 
All the usual methods of linear algebra work to solve the above equations: Cramer’s rule, Gaussian 
elimination, etc. With the whole vector space, you can solve simultaneous linear vector equations for 
unknown vectors, such as 


Q11V, + A2V2 +. ,V, = Wy 


91V] + Ag9V2 + .-Ao,V, = Wo 


written in matrix form as av=w, 


AnyV + naV2 +-+-Ann Vn = Wy 


where a is again a matrix of scalars. The same methods of linear algebra work just as well to solve vector 
equations as scalar equations. 


Vector spaces may also have these properties: 


The key points of mathematical vectors are (1) we can form linear combinations of them to make 
other vectors, and (2) any vector can be written as a linear combination of basis vectors: 


1 yw) = vile, + veo + ve3 
where 1, €2, €3 are basis vectors, and 
v!, v?, v are the components of v in the e, e2, e3 basis. 


v=(v 


Note that v!, v?, v? are numbers, while e; , e2 , e3 are vectors. There is a (kind of bogus) reason why basis 
vectors are written with subscripts, and vector components with superscripts, but we’ll get to that later. 


The dimension of a vector space, N, is the number of basis vectors needed to construct every vector in 
the space. 


Do not confuse the dimension of physical space (typically 1D, 2D, 3D, or (in relativity) 4D), 


with the dimension of the mathematical objects used to work a problem. 


For example, a 3x3 matrix is an element of the vector space of 3x3 matrices. This is a 9-D vector space, 
because there are 9 basis matrices needed to construct an arbitrary matrix. 


Given a basis, components are equivalent to the vector. Components alone (without a basis) are 
insufficient to be a vector. 


[Aside: Note that for position vectors defined by r = (r, 0, ¢), r, 6, and ¢ are not the components of 
a vector. The tip off is that with two vectors, you can always add their components to get another vector. 


Clearly, r, +1 #(7,+,9, + 9,4, + ,), 80 (r, , g) cannot be the components of a vector. This failure 


to add is due to r being a displacement vector from the origin, where there is no consistent basis: e.g., 
what is e, at the origin? At points off the origin, there is a consistent basis: e,, eg, and eg are well-defined. ] 


When Vectors Collide 


There now arises a collision of terminology: to a physicist, “vector” usually means a physical vector in 
3- or 4-space, but to a mathematician, “vector” means an element of a mathematical vector-space. These are 
two different meanings, but they share a common aspect: linearity (i.e., we can form linear combinations of 
vectors to make other vectors, and any vector can be written as a linear combination of basis vectors). 
Because of that linearity, we can have general rank-n tensors whose components are arbitrary elements of a 
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mathematical vector-space. To make the terminology confusion worse, an (™,) tensor whose components are 
simple numbers is itself a “vector-element” of the vector-space of (™n) tensors. 


Mathematical vector-elements of a vector space are much more general than physical vectors (e.g. force, 
or velocity), though physical vectors and tensors are elements of mathematical vector spaces. To be clear, 
we’ll use MVE to refer to a mathematical vector-element of a vector space, and “vector” to mean a normal 
physics vector (3-vector or 4-vector). Recall that MVEs are usually written as a set of components in some 
basis, just like vectors are. In the beginning, we choose all the input MVEs to be vectors. 


If you’re unclear about what an MVE is, just think of it as a physical vector for now, like “force.” 


“Tensors” vs. “Symbols” 


There are lots of tensors: metric tensors, electromagnetic tensors, Riemann tensors, etc. There are also 
“symbols:” Levi-Civita symbols, Christoffel symbols, etc. What’s the difference? “Symbols” aren’t tensors. 
Symbols look like tensors, in that they have components indexed by multiple indices, they are referred to 
basis vectors, and are summed with tensors. But they are defined to have specific components, which may 
depend on the basis, and therefore symbols don’t change basis (transform) the way tensors do. Hence, 
symbols are not geometric entities, with a meaning in a manifold, independent of coordinates. For example, 
the Levi-Civita symbol is defined to have specific constant components in all bases. It doesn’t follow the 
usual change-of-basis rules. Therefore, it cannot be a tensor. 


Notational Nightmare 


If you come from a differential geometry background, you may wonder about some insanely confusing 
notation. It is a fact that “dx” and “dx” are two different things: 


dx = (dx, dy, dz) is a vector, but 
dx = Vx(r) is a 1-form 


We don’t use the second notation (or exterior derivatives) in this chapter, but we might in the Differential 
Geometry chapter. 


Tensors? What Good Are They? 


A Short, Complicated Definition 


It is very difficult to give a short definition of a tensor that is useful to anyone who doesn’t already know 
what a tensor is. Nonetheless, you’ve got to start somewhere, so we’ll give a short definition, to point in the 
right direction, but it may not make complete sense at first (don’t get hung up on this, skip if needed): 


A tensor is an operator on one or more mathematical vector elements (MVEs), linear in each operand, 
which produces another mathematical vector element. 


The key point is this (which we describe in more detail in a moment): 


Linearity in all the operands is the essence of a tensor. 


I should add that the basis vectors for all the MVEs must be the same (or tensor products of the same) 
for an operator to qualify as a tensor. But that’s too much to put in a “short” definition. We clarify this point 
later. 


Note that a scalar (i.e., a coordinate-system-invariant number, but for now, just a number) satisfies the 
definition of a “mathematical vector element.” 


Many definitions of tensors dwell on the transformation properties of tensors. This is mathematically 
valid, but such definitions give no insight into the use of tensors, or why we like them. Note that to satisfy 
the transformation properties, all the input vectors and output tensors must be expressed in the same basis (or 
tensor products of that basis with itself). 
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Some coordinate systems require distinguishing between contravariant and covariant components of 
tensors; superscripts denote contravariant components; subscripts denote covariant components. However, 
orthonormal positive definite systems, such as the familiar Cartesian, spherical, and cylindrical systems, do 
not require such a distinction. So for now, let’s ignore the distinction, even though the following notation 
properly represents both contravariant and covariant components. Thus, in the following text, contravariant 
components are written with superscripts, and covariant components are written with subscripts, but we don’t 
care right now. Just think of them all as components in an arbitrary coordinate system. 


Building a Tensor 


Oversimplified, a tensor operates on vectors to produce a scalar or a vector. Let’s construct a tensor 
which accepts (operates on) two 3-vectors to produce a scalar. (We’ll see later that this is a rank-2 tensor.) 
Let the tensor T act on vectors a and b to produce a scalar, s; in other words, this tensor is a scalar function 
of two vectors: 


s=T(a,b). 


Call the first vector a = (a!, a”, a*) in some basis, and the second vector b = (b!, b’, b*) (in the same basis). 
A tensor, by definition, must be linear in both a and b; if we double a, we double the result, if we triple b, 
we triple the result, etc. Also, 


T(a+c, b)=T(a,b)+T(c,b), and Tia, b+d)=T@a, b) + T(a,d). 


So the result must involve at least the product of a component of a with a component of b. Let’s say the 
tensor takes ab! as that product, and additionally multiplies it by a constant, T>;. Then we have built a tensor 
acting on a and b, and it is linear in both: 


T(a,b) = Tie O Example: T(a,b)= Tab! | 


But, if we add to this some other weighted product of some other pair of components, the result is still a 
tensor: it is still linear in both a and b: 


T(a,b) =7,3a'b? +T>,a7b'. Example: T(a,b) = 4a!b? +7a7b'. 


In fact, we can extend this to the weighted sum of all combinations of components, one each from a and b. 
Such a sum is still linear in both a and b: 


.- 2 6 4 
T(a,b)= >) >) Tja‘b/ Example: J =| 7 5 -1 
i=l j=l 6 0 8 


Further, nothing else can be added to this that is linear in a and b. 


A tensor is the most general linear function of a and b that exists, 


i.e. any linear function of a and b can be written as a 3x3 matrix. 


(We’ll see that the rank of a tensor is equal to the number of its indices; T is a rank-2 tensor.) The Tj are 
the components of the tensor (in the basis of the vectors a and b.) At this point, we consider the components 
of T, a, and b all as just numbers. 


Why does a tensor have a separate weight for each combination of components, one from each input 
mathematical vector element (MVE)? Couldn’t we just weight each input MVE as a whole? No, because 
that would restrict tensors to only some linear functions of the inputs. 


Any linear function of the input vectors can be represented as a tensor. 


Note that tensors, just like vectors, can be written as components in some basis. And just like vectors, 
we can transform the components from one basis to another. Such a transformation does not change the 
tensor itself (nor does it change a vector); it simply changes how we represent the tensor (or vector). More 
on transformations later. 
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Tensors don’t have to produce scalar results! 


Some tensors accept one or more vectors, and produce a vector for a result. Or they produce some rank- 
r tensor for a result. In general, a rank-n tensor accepts ‘m’ vectors as inputs, and produces a rank ‘n—m’ 
tensor as a result. Since any tensor is an element of a mathematical vector space, tensors can be written as 
linear combinations of other (same rank & type) tensors. So even when a tensor produces another (lower 
rank) tensor as an output, the tensor is still a linear function of all its input vectors. It’s just a tensor-valued 
function, instead of a scalar-valued function. For example, the force on a charge: a B-field operates on a 
vector, qv, to produce a vector, f. Thus, we can think of the B-field as a rank-2 tensor which acts on a vector 
to produce a vector; it’s a vector-valued function of one vector. 


Also, in general, tensors aren’t limited to taking just vectors as inputs. Some tensors take rank-2 tensors 
as inputs. For example, the quadrupole moment tensor operates on the 2™ derivative matrix of the potential 
(the rank-2 “Hessian” tensor) to produce the (scalar) work stored in the quadrupole of charges. And a density 
matrix in quantum mechanics is a rank-2 tensor that acts on an operator matrix (rank-2 tensor) to produce the 
ensemble average of that operator. 


Tensors in Action 


Let’s consider how rank-0, rank-1, and rank-2 tensors operate on a single vector. Recall that in “tensor- 
talk,” a scalar is an invariant number, i.e. it is the same number in any coordinate system. 


Rank-0: A rank-0 tensor is a scalar, i.e. a coordinate-system-independent number. Multiplying a vector 
by a rank-O tensor (a scalar), produces a new vector. Each component of the vector contributes to the 
corresponding component of the result, and each component is weighted equally by the scalar, a: 


vevitvj+vk > av =av'it+avjt+av‘k. 


Rank-1: A rank-1 tensor a operates on (contracts with) a vector to produce a scalar. Each component 
of the input vector contributes a number to the result, but each component is weighted separately by the 
corresponding component of the tensor a: 


3 
~ _ x y ee as, i 
a(v)=a,v" +a,v’ +a,v" = av : 
i=l 
Note that a vector is itself a rank-1 tensor. Above, instead of considering a acting on v, we can equivalently 
consider that v acts on a: a(v) = v(a). Both a and v are of equal standing. 


Rank-2: Filling one slot of a rank-2 tensor with a vector produces a new vector. Each component of 
the input vector contributes a vector to the result, and each input vector component weights a different vector. 


column 3 
column 1 


column 2 — a 
(a) (b) (c) 
(a) A hypothetical rank-2 tensor with an x-vector (red), a y-vector (green), and a z-vector (blue). (b) 


The tensor acting on the vector (1, 1, 1) producing a vector (heavy black). Each component (column) 
vector of the tensor is weighted by 1, and summed. (c) The tensor acting on the vector (0, 2, 0.5), 
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producing a vector (heavy black). The x-vector is weighted by 0, and so does not contribute; the y- 
vector is weighted by 2, so contributes double; the z-vector is weighted by 0.5, so contributes half. 


B", | B’, B*, 
BC = v) = Biv! = B’. Bs +y* ae 
a BS B’, 
‘ . 3 —— 
= Bw +Byv’ +B = ) j , 2 » yl jt 2B iy k 
j= 


The columns of B are the vectors which are weighted by each of the input vector components, v /; or 
equivalently, the columns of B are the vector weights for each of the input vector components 


Example of a simple rank-2 tensor: the moment-of-inertia tensor, J. Every blob of matter has one. 
We know from mechanics that if you rotate an arbitrary blob around an arbitrary axis, the angular momentum 
vector of the blob does not in general line up with the axis of rotation. So what is the angular momentum 
vector of the blob? It is a vector-valued linear function of the angular velocity vector, i.e. given the angular 
velocity vector, you can operate on it with the moment-of-inertia tensor, to get the angular momentum vector. 
Therefore, by the definition of a tensor as a linear operation on a vector, the relationship between angular 
momentum vector and angular velocity vector can be given as a tensor; it is the moment-of-inertia tensor. It 
takes as an input the angular velocity vector, and produces as output the angular momentum vector, therefore 
it is a rank-2 tensor: 


I(o,_) = L, I(@,@) =L-@ =2KE. 


[Since I is constant in the blob frame, it rotates in the lab frame. Thus, in the lab frame, the above equations 
are valid only at a single instant in time. In effect, Iis a function of time, I(A).] 


[?? This may be a bad example, since I is only a Cartesian tensor [L&L3, p ??], which is not a real tensor. 
Real tensors can’t have finite displacements on a curved manifold, but blobs of matter have finite size. If you want 
to get the kinetic energy, you have to use the metric to compute L-@. Is there a simple example of a real rank-2 
tensor??] 


Note that some rank-2 tensors operate on two vectors to produce a scalar, and some (like I) can either 
act on one vector to produce a vector, or act on two vectors to produce a scalar (twice the kinetic energy). 
More of that, and higher rank tensors, later. 


Tensor Fields 


A vector is a single mathematical object, but it is quite common to define a field of vectors. A field in 
this sense is a function of space. A vector field defines a vector for each point in a space. For example, the 
electric field is a vector-valued function of space: at each point in space, there is an electric field vector. 


Similarly, a tensor is a single mathematical object, but it is quite common to define a field of tensors. At 
each point in space, there is a tensor. The metric tensor field is a tensor-valued function of space: at each 
point, there is a metric tensor. Almost universally, the word “field” is omitted when calling out tensor fields: 
when you say “metric tensor,” everyone is expected to know it is a tensor field. When you say “moment of 
inertia tensor,” everyone is expected to know it is a single tensor (not a field). 


Dot Products and Cross Products as Tensors 


Symmetric tensors are associated with elementary dot products, and anti-symmetric tensors are 
associated with elementary cross-products. 
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A dot product is a linear operation on two vectors: A-B = B-A, which produces a scalar. Because the 
dot product is a linear function of two vectors, it can be written as a tensor. (Recall that any linear function 
of vectors can be written as a tensor.) Since it takes two rank-1 tensors, and produces a rank-0 tensor, the 
dot product is a rank-2 tensor. Therefore, we can achieve the same result as a dot product with a rank-2 
symmetric tensor that accepts two vectors and produces a scalar; call this tensor g: 


g(A, B) = g(B, A). 


‘g’ is called the metric tensor: it produces the dot product (aka scalar product) of two vectors. Quite often, 
the metric tensor varies with position (i.e., itis a function of the generalized coordinates of the system); then 
itis a metric tensor field. It happens that the dot product is symmetric: A-B =B-A.; therefore, g is symmetric. 
If we write the components of g as a matrix, the matrix will be symmetric, i.e. it will equal its own transpose. 
(Do I need to expand on this??) 


On the other hand, a cross product is an anti-symmetric linear operation on two vectors, which produces 
another vector: A x B=—B x A. Therefore, we can associate one vector, say B, with a rank-2 anti-symmetric 
tensor, that accepts one vector and produces another vector: 


Bc_, A) =—B(A, _). 


For example, the Lorentz force law: F = v x B. We can write B as a (1) tensor: 


i jk 0 B, -B,]/v*| | By? - By 
F=vxB=|v vw’ v/=B(_,v)=B',v’ =|-B, 0 By || v" |=|-B,v" + Byv* 
By By B. B, =B, 0 ve Byv* = Bv 


We see again how a rank-2 tensor, B, contributes a vector for each component of v: 
B'. e;= —B.j + Byk (the first column of B) is weighted by v*. 
B', e; = Bi — B,k (the 2" column of B) is weighted by v’. 
Bi, e; = —Byi + B,j (the 3 column of B) is weighted by v*. 

B,, By, B.> 0 


NX 


AZ 


Bi =-B j+B,k 


Bi =Bi-B,k 


A rank-2 tensor acting on a vector to produce their cross-product. 


TBS: We can also think of the cross product as a fully anti-symmetric rank-3 tensor, which acts on 2 
vectors to produce a vector (their cross product). This is the anti-symmetric symbol ¢,; (not a tensor). 


Note that both the dot product and cross-product are linear on both of their operands. For example: 
(aA +7vC)-B=a(A-B)+7(C-B) 
A: ($B +7D) = B(A-B)+7(A-D) 


Linearity in all the operands is the essence of a tensor. 
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Note also that a “rank” of a tensor contracts with (is summed over) a “rank” of one of its operands to 
eliminate both of them: one rank of the B-field tensor contracts with one input vector, leaving one surviving 
rank of the B-field tensor, which is the vector result. Similarly, one rank of the metric tensor, g, contracts 
with the first operand vector; another rank of g contracts with the second operand vector, leaving a rank-0 
(scalar) result. 


The Danger of Matrices 


There are some dangers to thinking of tensors as matrices: (1) it doesn’t work for rank 3 or higher tensors, 
and (2) non-commutation of matrix multiplication is harder to follow than the more-explicit summation 
convention. Nonetheless, the matrix conventions are these: 


e contravariant components and basis covectors (“up” indexes) — column vector. E.g., 


1 


v, e 
v=|v, |, basis 1-forms: & 
V3 e 


e covariant components and basis contravectors (“down” indexes) — row vector 
W =(W,,W,W); basis vectors: (€,,€,,€;) . 


Matrix rows and columns are indicated by spacing of the indexes, and are independent of their “upness” 
or “downness.” The first matrix index is always the row; the second, the column: 


tT’. T.“ T. ii where r=row index, c = column index . 


c re 


Reading Tensor Component Equations 
Tensor equations can be written as equations with tensors as operators (written in bold): 
KE=Y%I(Q, @). 
Or, they can be written in component form: 
(1) KE=%];o'a’. 


We'll be using lots of tensor equations written in component form, so it is important to know how to read 
them. Note that some standard notations almost require component form: In GR, the Ricci tensor is R“”, and 
the Ricci scalar is R: 
1 
Gy, = Ruy -_ 5 RB uv : 

In component equations, tensor indexes are written explicitly. There are two kinds of tensor indexes: 
dummy (aka summation) indexes, and free indexes. Dummy indexes appear exactly twice in any term. Free 
indexes appear only once in each term, and the same free indexes must appear in each term (except for scalar 


terms). In the above equation, both yw and v are free indexes, and there are no dummy indexes. In eq. (1) 
above, i and j are both dummy indexes and there are no free indexes. 


Dummy indexes appear exactly twice in any term, and are used for implied summation, e.g.: 


| oaeer 
KE = see 


ll 
al 
ll 

w]e 
Me 
Me- 

a 

& 

& 


Free indexes are a shorthand for writing several equations at once. Each free index takes on all possible 
values for it. Thus, 


C'=A'+B' = C=A +B, C=A4+B, C=A4+B (3 equations) , 


and 
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1 
Gi, > Ryy ~ 5 Rav 


1 1 1 1 

Goo = Roo ~ 5 R800 Go, = Ro ~5 Rear Go = Roo ~ 5 R802" Go3 = Ros ~ 5 R803 
1 1 1 1 

Gio = Rio — 5 R810 Go =a Beas Gi Migs Sioa Gi; =a hei 
1 1 1 1 

G9 = Roo — 5 R820» Gy, = Ry iy Beats Gy) = Roy gy haD G3 = Rog “pen 
1 1 1 1 

G39 = Ryo — 5 R830> G3, = R3, — 5 R831, Gyp = Ryo ~ 5 R820» G33 = R33 = he 


(16 equations). 


It is common to have both dummy and free indexes in the same equation. Thus the GR statement of 
conservation of energy and momentum uses yw as a dummy index, and v as a free index: 


3 Vr" =0, x V7 a0, : Vr" =0, x VI" =0. 
He ue 


u=0 u=0 


VF" =0 


(4 equations). Notice that scalars apply to all values of free indexes, and don’t need indexes of their own. 
However, any free indexes must match on all tensor terms. It is nonsense to write something like: 


A! =B'+C! (nonsense) . 
However, it is reasonable to have 


A! = B'C! E.g., angular momentum: M! =r'p/—r/p'. 


Adding, Subtracting, Differentiating Tensors 


Since tensors are linear operations, you can add or subtract any two tensors that take the same type 
arguments and produce the same type result. Just add the tensor components individually. 


S=T+U E.g. S' =T'4U", i,j=l,.uN. 


You can also scalar multiply a tensor. Since these properties of tensors are the defining requirements 
for a vector space, all the tensors of given rank and index types compose a vector space, and every tensor is 
an MVE in its space. This implies that: 


A tensor field can be differentiated (or integrated), and in particular, it has a gradient. 


Higher Rank Tensors 


When considering higher rank tensors, it may be helpful to recall that multi-dimensional matrices can 
be thought of as lower-dimensional matrices with each element itself a vector or matrix. For example, a3 x 
3 matrix can be thought of as a “column vector” of 3 row-vectors. Matrix multiplication works out the same 
whether you consider the 3 x 3 matrix as a 2-D matrix of numbers, or a 1-D column vector of row vectors: 


abe 
(x y z) de ff =(ax + dy + gz bx +ey +hz cx + fy +iz) 
ghi 
or N\ 
(a, b,c) 
(x y z) (d,e, f) = x(a,b,c) + y(d,e, f) + 2(g,h,i) = (ax + dy + gz bx +ey +hz cx + fy +iz) 
(g,h, 0) 
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Using this same idea, we can compare the gradient of a scalar field, which is a (°1) tensor field (a 1- 
form), with the gradient of a rank-2 (say (°2)) tensor field, which is a (°3) tensor field. First, the gradient of a 
scalar field is a (°)) tensor field with 3 components, where each component is a number-valued function: 


Vf =D= a w+ a o> + of o°, @,,@,@3 are basis (co)vectors 
Ox oy Oz 
D can be written as (D,,D,,D3), where D, = of » D,= of , D3z= of 


Ox oy ag 


The gradient operates on an infinitesimal displacement vector to produce the change in the function when 


you move through the given displacement: df = D(dr) = _ dx + - dy + - dz. 
x y z 


Now let R be a (°2) tensor field, and T be its gradient. T is a (°3) tensor field, but can be thought of as a 
(°:) tensor field where each component is itself a (°2) tensor. 


x-tensor y-tensor z-tensor 


Figure 16.1 A rank-3 tensor considered as a set of 3 rank-2 tensors: an x-tensor, a y-tensor, and a 
z-tensor. 


The gradient operates on an infinitesimal displacement vector to produce the change in the (°2) tensor 
field when you move through the given displacement. 


OR 1, R2 , R,3 


eas? oy Oz 
Tix Tox T3x Thy Noy T3y Nz N22 1132 
=||Foix "oan To3x] |Pa1y Tory M3y| | 2212 M22 123z 
T3ix Tox %33x| |B1y Bay B3y| 231z oz 1332 


2S . 


dR = T(v) = », Tvs O (dR), =Tiav* 


K=x,y,% 
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Note that if R had been a (70) (fully contravariant) tensor, then its gradient would be a (*1) mixed tensor. 
Taking the gradient of any field simply adds a covariant index, which can then be contracted with a 
displacement vector to find the change in the tensor field when moving through the given displacement. 


The contraction considerations of the previous section still apply: a rank of an tensor operator contracts 
with a rank of one of its inputs to eliminate both. In other words, each rank of input tensors eliminates one 
rank of the tensor operator. The rank of the result is the number of surviving ranks from the tensor operator: 


rank(tensor) = ($7 rank(inputs)) +rank(result) 


or rank(result) = rank(tensor) — (S2rank(inputs)) , 


Tensors of Mathematical Vector Elements: The operation of a tensor on vectors involves multiplying 
components (one from the tensor, and one from each input vector), and then summing. E.g., 


T(a,b) =7,,a'b! +...4+Tya'b/ +... . 


Similar to the above example, the T;; components could themselves be a vector of a mathematical vector 
space (i.e., could be MVEs), while the a! and b/ components are scalars of that vector space. In the example 
above, we could say that each of the Tj.. , Ti.» , and Tjj., is a rank-2 tensor (an MVE in the space of rank-2 
tensors), and the components of v are scalars in that space (in this case, real numbers). 


Tensors In General 


Linearity implies that T can be written as a numeric weight for each combination of components, one 
component from each input MVE. Thus, the “linear operation” performed by T is equivalent to a weighted 
sum of all combinations of components of the input MVEs. (Since T and the a, b, ... are simple objects, not 
functions, there is no concept of derivative or integral operations. Derivatives and integrals are linear 
operations on functions, but not linear functions of MVEs.) 


Given the components of the inputs a, b, ..., and the components of T, we can contract T with (operate 
with T on) the inputs to produce a MVE result. Note that all input MVEs have to have the same basis. Also, 
T may have units, so the output units are arbitrary. Note that in generalized coordinates, different components 
of a tensor may have different units (much like the vector parameters r and @ have different units). 


Change of Basis: Transformations 


Since tensors are linear operations on MVEs, we can represent a tensor by components. If we know a 
tensor’s operations on all combinations of basis vectors, we have fully defined the tensor. Consider a rank- 
2 tensor T acting on two vectors, a and b. We expand T, a, and b into components, using the linearity of the 
tensor: 


T(a,b) = T(a'i+ aj + a*k, b'i + b°j+b°k) 
=a'b'T(i,i) + a7b'T(,i) + a°b'Tk, i) 
+a'b°T(i, j) + a7b°T(j, j) + vb Tk, j) 
+a'b°T(i,k) + a’b°T(j,k) + ab? T(k,k) 
Define T;; = T(e;,e;), where e, =i, e, =j, e; =k 


t 
> T,a'b! 


3 3 
=| j= 


then Tab)=> y a'b'Tle,,e;)= > 
ieee : 


t 
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The tensor’s values on all combinations of input basis vectors are the components of the tensor (in the basis 
of the input vectors.) 


Now let’s transform T to another basis. To change from one basis to another, we need to know how to 
find the new basis vectors from the old ones, or equivalently, how to transform components in the old basis 
to components in the new basis. We write the new basis with primes, and the old basis without primes. 


Because vector spaces demand linearity, any change of basis can be written as a linear transformation of 
the basis vectors or components, so we can write (eq. #s from Talman): 


e';= Sake, =A‘ e, [Tal 2.4.5] 
k=1 

Fe ~ 1k -1\ _k 

v ae yy =(A yy [Tal 2.4.8] 


where the last form uses the summation convention. There is a very important difference between 
equations 2.4.5 and 2.4.8. The first is a set of 3 vector equations, expressing each of the new basis vectors in 
the old basis 


Aside: Let’s look more closely at the difference between equations 2.4.5 and 2.4.8. The first is a set of 3 vector 
equations, expressing each of the new basis vectors in the old basis. Basis vectors are vectors, and hence can 
themselves be expressed in any basis: 


e' =Ale,+A’e,+A’e, e' =a'e,+ae, +a, 
e', =A‘e,+A’,e,+A°,e, or more simply 4e', =b'e,+b’e, +b’e, 
e', =Ale, + A’,e, + A’,e, e',=c'e,+c’e,+ce, 


where the a’s are the components of e’: in the old basis, the b’s are the components of e’2 in the old basis, and the 
c’s are the components of e’3 in the old basis. 


In contrast, equation 2.4.8 is a set of 3 number equations, relating the components of a single vector, taking its 
old components into the new basis. In other words, in the first equation, we are taking new basis vectors and 
expressing them in the old basis (new — old). In the second equation, we are taking old components and converting 
them to the new basis (old — new). The two equations go in opposite directions: the first takes new to old, the 
second takes old to new. So it is natural that the two equations use inverse matrices to achieve those conversions. 
However, because of the inverse matrices in these equations, vector components are said to transform “contrary” 
(oppositely) to basis vectors, so they are called contravariant vectors. 


I think it is misleading to say that contravariant vectors transform “oppositely” to basis vectors. In fact, that is 
impossible. Basis vectors are contravectors, and transform like any other contravector. A vector of (1, 0, 0) Gn 
some basis) is a basis vector. It may also happen to be the value of some physical vector. In both cases, the 
expression of the vector (1, 0, 0) (old basis) in the new-basis is the same. 


Now we can use 2.4.5 to evaluate the components of T in the primed basis: 


N WN N WN 
T', =Te'j,e'))=T(A* eA‘) = >) >) AXA‘ Tee) =>) >) AGA! Ta . 
k=l [=1 k=l [=1 


Notice that there is one use of the transformation matrix A for each index of T to be transformed. 


Matrix View of Basis Transformation 


The concept of tensors seems clumsy at first, but it’s a very fundamental concept. Once you get used to 
it, tensors are essentially simple things (though it took me 3 years to understand how “simple” they are). The 
rules for transformations are pretty direct. Transforming a rank-n tensor requires using the transformation 
matrix n times. A vector is rank-1, and transforms by a simple matrix multiply, or in tensor terms, by a 
summation over indices. Here, since we must distinguish row basis from column basis, we put the primes 
on the indices, to indicate which index is in the new basis, and which is in the old basis. 
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a? A 0 A° 1 A pA A 3 a 
a! Al 0 Al 1 A! 2 Al 3 a Fe Po 
a'= Aa o ieee i eee 55 . o a” =A" ya. 
a A A A A a 


' ' ' ' ' 3 
a AD 0 AB 1 AB 2 AB 3 a 


Notice that when you sum over (contract over) two indices, they disappear, and you’re left with the 
unsummed index. So above when we sum over old-basis indices, we’re left with a new-basis vector. 


Rank-2 example: The electromagnetic field tensor F is rank-2, and transforms using the transformation 
matrix twice, by two summations over indices, transforming both stress-energy indices. This is clumsy to 
write in matrix terms, because you have to use the transpose of the transformation matrix to transform the 
rows; this transposition has no physical significance. In the rank-2 (or higher) case, the tensor notation is 
both simpler, and more physically meaningful: 


F'=AFA’ o 
Foo Fel Fo? F° 3 AP? Aol A® 2 Ao F® PF! F@ FS A A‘? Az? A?° 
Fi Fil Fi2 F! 3 Ale Ail A! 2) A'3 Fo F'! F? FPF Aol Ail AZ) Ae! 
Feo F?! Fo? F2 3° | Az? M2 1 A 2 AZ3 F2 F?! F2 PF? AP? Al2 A22 Ae? 
F?° F?! Fe? FP? 3 A? Ab! ie 2: AB3 Fo F?! F2 FPF? AP A’3 AZ3 ABS 
o POY AAP KY ee 


In general, you have to transform every index of a tensor, each index requiring one use of the 
transformation matrix. 


Geometric (Coordinate-Free) Dot and Cross Products 


Coordinate-free methods use definitions in physical or geometric terms, without reference to any 
particular coordinates. For example, in physics, the product of the parallel components of two vectors has 
direct physical meaning and application. Similarly, the familiar vector cross product also has direct physical 
meaning. We here introduce some rudiments of coordinate-free methods, and how they /ead to the familiar 
formulas for dot and cross products. From the geometric definitions of cross and dot products, we show this 
crucial property of both: 


The dot product and cross product are linear. 


Dot Product: Parallel Components 
We define the product of parallel components of two vectors as the dot product (Figure 16.2a): 
dot product The dot product of two vectors is the product of their parallel components, which is a scalar. 


All properties of the dot product follow from this geometric definition, including the well-known formula for 
ortho-normal rectangular coordinates: 


aC =4,C, +AyCy +a,C,. 
This formula is not a definition; it is a consequence of the coordinate-free definition. 


Figure 16.2b shows graphically that the dot product is distributive in two dimensions (2D). Because this 
proof is coordinate-free, it is true even in oblique (non-orthogonal) non-normal coordinates. The full proof 
of bilinearity includes showing that the dot product is also commutative, and commutes with scalar 
multiplication: 


arc =cea, (ka)ec =k (ace) . 
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We leave that as an exercise. (We examine cross products below, but there can be no cross-product in only 
two dimensions.) 


(a+b)e=(a,+b,)c 


ac = a,c 
5 1 

=> c u Ww! 

(a) (b) (a+b), (c) 


Figure 16.2 (a) Dot product is the product of parallel components. (b) Coordinate-free proof that 
the dot product is distributive in 2D. (c) Proof that the dot product is distributive in 3D. 


It’s a little harder to visualize, but it is crucially important that we can show linearity in 3D as well, using 
coordinate free methods. This linearity is what justifies the (coordinate dependent) algebraic formulas for 
dot product (and cross product); we cannot start with those formulas, and use them to prove linearity. 


For the dot product in 3D, consider (a +b)ec , shown in Figure 16.3c. We can always choose a y-axis 


such that the c-y plane contains both a and c. b, however, points partly upward (say) above the c-y plane. 
To construct the component of a (or b) parallel to c, construct a plane perpendicular to c, and containing the 
tip of a (or b). Thus the sum of the parallel components equals the parallel component of the sum. Therefore, 
the dot product is linear (in its first factor). Since dot product is commutative, it must be linear in its second 
factor, as well. Therefore, the dot product is bilinear. 


Cross Product: Perpendicular Components 


Similarly, though a little more complicated, we can define the cross product in coordinate free terms 
(Figure 16.3a): 


cross product The cross product of two vectors, ¢xa, is a vector whose magnitude is the product of c’s 


and a’s perpendicular components, and whose direction is perpendicular to both ¢ and a, 
and oriented with a right-hand screw sense from c to a. 


(b) —> 


Figure 16.3 (a) Construction of a cross product. (b) The cross product of a sum: The projection 
of b slides its tip to the y-z plane. 


For the cross-product, first consider the geometric construction of a simple product cxa (Figure 16.3,a). 
We project a into the plane perpendicular to ¢ and containing c’s tail (the y-z plane). This yields a1, which 
is a vector. We then rotate ai a quarter turn (90 degrees) around c, in a direction given by the right-hand- 
rule. This vector points in the direction of cxa. Multiply its length by the magnitude of c to get cxa. 


The right-hand rule implies that the cross product is anti-commutative: cxa=—axc. 


Now repeat this process for the product cx(a + b) (Figure 16.3b). We start by projecting a and b onto 
the y-z plane. As shown, the (vector) sum of the projections equals the projection of the sum. Now we must 
rotate the projections about c by 90 degrees. Rotation is a linear operator, so the sum of the rotations equals 
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the rotation of the sum. Hence, the cross product is linear (in the second factor). Since cross product is anti- 
commutative, it must be linear in the first factor, as well. Cross product is bilinear. 


Non-Orthonormal Systems: Contravariance and Covariance 


Many systems cannot be represented with orthonormal coordinates, e.g. the (surface of a) sphere. 
Dealing with non-orthonormality requires a more sophisticated view of tensors, and introduces the concepts 
of contravariance and covariance. We start with an introduction to coordinate-free methods. 


Dot Products in Oblique Coordinates 


Oblique coordinates (non-orthogonal axes) appear in many areas of physics and engineering, such as 
generalized coordinates in classical mechanics, and in the differential geometry of relativity. Understanding 
how to compute dot products in oblique coordinates is the foundation for many physically meaningful 
computations, and for the mathematics of contravariant and covariant components of a vector. The “usual” 
components of a vector are the ones called the “contravariant” components. 


We here give several views of dot products and metric tensors. Then, we define the “covariant” 
components of a vector, and show why they are useful (and unavoidable). Finally, we show that a gradient 
is “naturally” covariant. To illustrate, we use a two-dimensional manifold, which is the archetype of all 
higher-dimensional generalizations. This section uses Einstein summation, and makes reference to tensors, 
but you need not understand tensors to follow it. In fact, this is a step on the road to understanding tensors. 


+ 
i H 
Bit... y covariant ‘ covariant 
= \pceweest rareueserecmed a - 


contra- 


variant —*: 
ge See 


(c) Se ee as ee ee bey 


Figure 16.4 (a) Two vectors in oblique coordinates. (b) Geometric meaning of contravariant and 


covariant components of a vector. (c) Example of contravariant and covariant components in a 
different basis. 


In oblique coordinates, we still write a vector as a sum of components (Figure 16.4a): 


a=a"e,+a’e, where e,,€,, = basis vectors . 


Note that when a distinction is made between contravariant and covariant components, the “usual” ones are 
called contravariant, and are written with a superscript. That is, we construct the vector a by walking a* units 
in the x-direction, and then a” units in the y-direction, even though x and y are not perpendicular. 


The dot product is defined geometrically, without reference to coordinates, as the product of the parallel 
components of two vectors. We showed earlier, also on purely geometric grounds without reference to any 


coordinates, that the dot product is bilinear (the distributive property holds for both vectors). Therefore, we 
can say: 


arb = (ae, +a@e, ){o*e, +b7e, ) =a'b"t, ¢,+a'be, 2, +0 be, 0,44 be <.,. (16.1) 
In Figure 16.4, the angle between the x and y axes is 0. If e, and e, are unit vectors, then we have: 

arb = a*b* +a*b’ cos +a°b* cosO+ab”. (16.2) 
In orthonormal coordinates, this reduces to the familiar formula for dot product: 


ee, =cos0=0 = ab =a*b* +a~b’. 


In general, the basis vectors need not be unit magnitude. 
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For brevity, it is standard to collect all the dot products of the unit vectors in (16.1) into a matrix, g,.. 
This makes it easier to write our dot product: 


Sv = : | = arb = g,,a"b" (Einstein summation) . 


e, ey 


In our example of unit vectors e, and e,, and axis angle 0: 


1 cos "| 


Suv = | cosO 1 


1 0 
For orthonormal coordinates, 6 = 2/2, and gj = , yielding aeb = a*b* + a>b”, as usual. 


Because the dot product is commutative, g,, is always symmetric. It can be readily shown that g,,, is a 
tensor; it is called the metric tensor. Usually, the dot products of unit vectors that compose g,, are functions 
of the coordinates x and y (or, say, rand 9). This means there is a different metric tensor at each point of the 
manifold: 


Suv = Spy(%,y). 


Any function of a manifold is called a field, so g,,(x, y) is the metric tensor field. Often when people refer 
to the “metric tensor,” they mean the metric tensor field. The metric tensor field is a crucial property of any 
curved manifold. 


Covariant Components of a Vector 


Consider the dot product asb. It is often helpful to consider separately the contributions to the dot 
product from a* and a’. From the linearity of dot products, we can write: 


ae (a’e, tare, ) =a*(e,+b)+a° (e, b) 


As shown in Figure 16.4b, the quantities in parentheses are just the component of b parallel to the x-axis and 
the y-axis. We define these quantities as the covariant components of b, written with subscripts: 


b,. =e,.-b, b, =e,%b => arb =a"b,,. (16.3) 


We have not changed the vector b; we have simply projected it onto the axes in a different way. In 
comparison: the “usual” contravariant component b* is the projection onto the x-axis taken parallel to all 
other axes (in this case the y axis). The covariant component b, is the component of b parallel to the x-axis. 
Note: 


Raising and Lowering Indexes: Is there an algebraic way to find b, from b*? Of course there is. We 
can evaluate the dot products in the definitions of (16.3) with the metric tensor: 


b, =e,*b = 84, (€,)° bY =8/(1,0)"b" =8,,b", andsimilarly, b,=g,,b”. 


e, 


We can write both b, and by in a single formula by using a free index, say wu: 
= v - 
by = 8 yy? ? HM=X,y. 
We could have derived this directly from the metric form of a dot product, though it wouldn’t have 


illuminated the geometric meaning of “covariant”: 
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What is a gradient? The familiar form of a gradient holds, even in curved manifolds: 
ve (ay == f(ex2)+£ F(ey?). 
Ox oy 


The components of the vector gradient are Of/Ox and Of/dy. But are Of/Ox and Of/Oy the contravariant or 
covariant components of the gradient vector? We answer that by considering how we use the gradient in a 
formula. Consider how the function f changes in an infinitesimal step from (x, y) to (x + dx, y + dy): 


df =(Vf )*(dxe, +dye,) Jo 
“Ox oy 


We did not need to use the metric to evaluate this dot product. Since the displacement vector (dx, dy) is 
contravariant, it must be that the gradient (Of/0x, Of/Oy) is covariant. Therefore, we write it with a subscript: 


Vf =0,f, bY= (dx, dy) => df = (0,,f )b" (Einstein summation) . 


In general, derivative operators (gradient, covariant derivative, exterior derivative, Lie derivative) produce 
covariant vector components (or a covariant index on a higher rank tensor). If our function f has units of 
“things”, and our displacement b“ is in meters, then our covariant gradient 0,fis in “things per meter.” Thus, 
covariant components can sometimes be thought of as “rates.” 


As a physical example, canonical momentum, a derivative of the lagrangian, is a covariant vector: 
_ OL 
0q' 


Pi where q' are the (contravariant) generalized coordinates . 


This allows us to calculate the classical action of mechanics, a physical invariant that is independent of 
coordinates, without using a metric: 
4i, final ; 7 . : : 
T= Di (q ) dq (Einstein summation) . 


4i initial 


This is crucial because phase space has no metric! Therefore, there is no such thing as a contravariant 
momentum p’‘. Furthermore, viewing the canonical momenta as covariant components helps clarify the 
meaning of Noether’s theorem (See Funky Mechanics Concepts). 


Elsewhere, we describe an alternative geometric description of covariant vector components: the 1-form. 


Example: Classical Mechanics with Oblique Generalized Coordinates 


Consider the following problem from classical mechanics: a pendulum is suspended from a pivot point 
which slides horizontally on a spring. The generalized coordinates are (a, 6). 


constant a 


~ 


(a,0+d0) 
(a+da, 0+d0) 


(a, )) 


constant 0 
(a+da, 0) 


dr = daatdo 6 


AS 


(a) (b) 
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Figure 16.5 (a) Classical mechanical system. (b) Differential area of configuration space. 


To compute kinetic energy, we need to compute |v|’, conveniently done in some orthogonal coordinates, say 
x and y. We start by converting the generalized coordinates to the orthonormal x-y coordinates, to compute 
the length of a physical displacement from the changes in generalized coordinates: 


x=a+lsindg, dx =da+lcos6 d@ 
y =Icos8, dy =—Isin@ dO = 
ds* = dx” + dy” =da* + 2I(cos@) da dO +I’ cos’ 0d@° +1 sin” 0 d0* 
= da” + 2I(cos@) da d0 +170". 


We have just computed the metric tensor field, which is a function of position in the (a, @) configuration 
space. We can write the metric tensor field components by inspection: 


Let xi= a, x7 =0 
2 2 a 1 lcos@ 
ds’ = 0 > g,dx'dx! = da” + 21cos6 da d0+17d0” => 8 -[, : 
il jal cosO 


Then for velocities: 
Iv] =i 4 y= [4 +1(cos0) éy + [-! (sin a)6| 


=a + 21 (cos) a0 +17 (cos? 0)6° +1" (sin? 0)6° 


=a? +21(cos0) 40 +176" 
1 lcos@ A oad 
= (=e. xx. 
leos@ I? 7 id 
ss. | 
Sij 


A key point here is that the same metric tensor computes a physical displacement from generalized coordinate 
displacements, or a physical velocity from generalized coordinate velocities, or a physical acceleration from 
generalized coordinate accelerations, etc., because time is the same for any generalized coordinate system 
(no Relativity here!). Note that we symmetrize the cross-terms of the metric, gj; = gj, which is necessary to 
insure that g(v, w) = g(w, v). 


Now consider the scalar product of two vectors. The same metric tensor (field) helps compute the scalar 
product (dot product) of any two (infinitesimal) vectors, from their generalized coordinates: 


dv - dw = g(dv,dw) = g,,dv'dw’ . 
Since the metric tensor takes two input vectors, is linear in both, and produces a scalar result, it is a rank-2 
tensor. Also, since g(v, w) = g(w, Vv), g is a symmetric tensor. 


Now, let’s define a scalar field as a function of the generalized coordinates; say, the potential energy: 
U= na’ —mgcos@. 


It is quite useful to know the gradient of the potential energy: 


pays ota at > janie ap pp: 
da a0 da a0 
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The gradient takes an infinitesimal displacement vector dr = (da, d@), and produces a differential in the value 
of potential energy, dU (a scalar). Further, dU is a linear function of the displacement vector. Hence, by 
definition, the gradient at each point in a-0 space is a rank-1 tensor, i.e. the gradient is a tensor field. 


Do we need to use the metric (computed earlier) to make the gradient operate on dr? No! The gradient 
operates directly on dr, without the need for any “assistance” by a metric. So the gradient is a rank-1 tensor 
that can directly contract with a vector to produce a scalar. This is markedly different from the dot product 
case above, where the first vector (a rank-1 tensor) could not contract directly with an input vector to produce 
a scalar. So clearly, 


Those tensors that can operate directly on a vector are called covariant tensors, and those that need help 
are called contravariant, for reasons we will show soon. To indicate that D is covariant, we write its 
components with subscripts, instead of superscripts. Its basis vectors are covariant vectors, related to e1, e2, 
and e3: 


D =D,’ = D,o + Dew? where ",@° are covariant basis vectors . 


In general, covariant tensors result from differentiation operators on other (scalar or) tensor fields: gradient, 
covariant derivative, exterior derivative, Lie derivative, etc. 


Note that just as we can say that D acts on dr, we can say that dr is a rank-1 tensor that acts on D to 
produce dU: 


OU i U4, U 


D(dr) = dr(D) = > asi es - 
X 


i 

The contractions are the same with either acting on the other, so the definitions are symmetric. 
Interestingly, when we compute small oscillations of a system of particles, we need both the potential 
matrix, which is the gradient of the gradient of the potential field, and the “mass” matrix, which really gives 
us kinetic energy (rather than mass). The potential matrix is fully covariant, and we need no metric to 


compute it. The kinetic energy matrix requires us to compute absolute magnitudes of |v|?, and so requires us 
to compute the metric. 


We know that a vector, which is a rank-1 tensor, can be visualized as an arrow. How do we visualize 
this covariant tensor, in a way that reveals how it operates on a vector (an arrow)? We use a set of equally 
spaced parallel planes (Figure 16.6). 


Div, + v,) = D(v,) + Div.) j/ ff 


v2 D(v,), D(v,) > 0 


D(v;) <0 


3 at 4 a 4 4 
+ 34-4 34-4 + 


Figure 16.6 Visualization of a covariant vector (1-form) as oriented parallel planes. 
The 1-form is a linear operator on vectors (see text). 


Let D be a covariant tensor (aka 1-form). The value of D on a vector, D(v), is the number of planes “pierced” 
by the vector when laid on the parallel planes. Clearly, D(v) depends on the magnitude and direction of v. 
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It is also a linear function of v: the sum of planes pierced by two different vectors equals the number of planes 
pierced by their vector sum, and scales with the vectors: D(av + bw) = aD(v)+bD(w) . 


There is an orientation to the planes. One side is negative, and the other positive. Vectors crossing in 
the negative to the positive direction “pierce” a positive number of planes. Vectors crossing in the positive 
to negative direction “pierce” a negative number of planes. 


Note also we could redraw the two axes arbitrarily oblique (non-orthogonal), and rescale the axes 
arbitrarily, but keeping the intercept values of the planes with the axes unchanged (thus stretching the arrows 
and planes). The number of planes pierced would be the same, so the two diagrams above are equivalent. 
Hence, this geometric construction of the operation of a covector on a contravector is completely general, 
and even applies to vector spaces which have no metric (aka “non-metric” spaces). All you need for the 
construction is a set of arbitrary basis vectors (not necessarily orthonormal), and the values D(ej) on each, 
and you can draw the parallel planes that illustrate the covector. 


The “direction” of D, analogous to the direction of a vector, is normal to (perpendicular to) the planes 
used to graphically represent D. 


What Goes Up Can Go Down: Duality of Contravariant and Covariant 
Vectors 


Recall the dot product is given by: 
dy -dw = g(dv,dw) = g,,dv'dw! . 


If I fill only one slot of g with v, and leave the 2" slot empty, then g(v, _ ) is a linear function of one vector, 
and can be directly contracted with that vector; hence g(v, _ ) is a rank-1 covariant vector. For any given 
contravariant vector v’, I can define this “dual” covariant vector, g(v, _ ), which has N components I’ Il call 
Vi. 


k 
V, =2(V,_)=8yV . 


The covariant representation can contract directly with a contravariant vector, and the contravariant 
representation can contract directly with a covariant vector, to produce the dot product of the two vectors. 
Therefore, we can use the metric tensor to “lower” the components of a contravariant vector into their 
covariant equivalents. 


Note that the metric tensor itself has been written with two covariant (lower) indexes, because it contracts 
directly with two contravariant vectors to produce their scalar-product. 


Why do I need two forms of the same vector? Consider the vector “force:” 
F=ma or F' =ma' (naturally contravariant). 


Since position x! is naturally contravariant, so is its derivative v', and 2" derivative, a‘. Therefore, force is 
“naturally” contravariant. But force is also the gradient of potential energy: 


F=-VU or F.= 2a (naturally covariant). 


; ax! 
Oops! Now “force” is naturally covariant! But physically, it’s the same force as above. So which is more 
natural for “force?” Neither. Use whichever one you need. Nurture supersedes nature. 


The inverse of the metric tensor matrix is the contravariant metric tensor, g!. It contracts directly with 
two covariant vectors to produce their scalar product. Hence, we can use g" to “raise” the index of a covariant 
vector to get its contravariant components: 
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vi =g(v,_)= 9" y gay = 8! ;- 


Notice that raising and lowering works on the metric tensor itself. Note that in general, even for symmetric 
tensors, T;/ 4 T;', and T;/ 4 T';. 


For rank-2 or higher tensors, each index is separately of the contravariant or covariant type. Each index 
may be raised or lowered separately from the others. Each lowering requires a contraction with the fully 
covariant metric tensor; each raising requires a contraction with the fully contravariant metric tensor. 


In Euclidean space with orthonormal coordinates, the metric tensor is the identity matrix. Hence, the 
covariant and contravariant components of any vector are identical. This is why there is no distinction made 
in elementary treatments of vector mathematics; displacements, gradients, everything, are simply called 
“vectors.” 


The space of covectors is a vector space, i.e. it satisfies the properties of a vector space. However, it is 
called “dual” to the vector space of contravectors, because covectors operate on contravectors to produce 
scalar invariants. A thing is dual to another thing if the dual can act on the original thing to produce a scalar, 
and vice versa. E.g., in QM, bras are dual to kets. “Vectors in the dual space” are covectors. 


Just like basis contravectors, basis covectors always have components (in their own basis): 


o' =(1,0,0..., 7? =(0,1,0..., o* =(0,0,1...), etc. 


and we can write an arbitrary covector as f = fot fo + fro +... 


TBS: construction and units of a dual covector from its contravector. 


The Real Summation Convention 


The summation convention says repeated indexes in an arithmetic expression are implicitly summed 
(contracted). We now understand that only a contravariant/covariant pair can be meaningfully summed. Two 
covariant or two contravariant indexes require contracting with the metric tensor to be meaningful. Hence, 
the real Einstein summation convention is that any two matching indexes, one “up” (contravariant) and one 
“down” (covariant), are implicitly summed (contracted). Two matching contravariant or covariant indexes 
are meaningless, and not allowed. 


Now we can see why basis contravectors are written e1, e2, ... (with subscripts), and basis covectors are 
written @!, w”, ... (with superscripts). It is purely a trick to comply with the real summation convention that 
requires summations be performed over one “up” index and one “down” index. Then we can write a vector 
as a linear combination of the basis vectors, using the summation convention: 


i 


v=ve +ve,+ve, =ve, @=a,0'+4,0° +a,0° =a4,0'. 


Note well: there is nothing “covariant” about e;, even though it has a subscript; there is nothing 


“contravariant” about w’, even though it has a superscript. It’s just a notational trick. 


Transformation of Covariant Indexes 


It turns out that the components of a covariant vector transform with the same matrix as used to express 
the new (primed) basis vectors in the old basis: 


Fake [Tal 2.4.11]. 


Again, somewhat bogusly, eq. 2.4.11 is said to “transform covariantly with” (the same as) the basis vectors, 
so ‘f;’ is called a covariant vector. 


For a rank-2 tensor such as Tj , each index of Tj transforms “like” the basis vectors (1.e., covariantly 
with the basis vectors). Hence, each index of Ti is said to be a “covariant” index. Since both indexes are 
covariant, Tj; is sometimes called “fully covariant.” 
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Indefinite Metrics: Relativity 


In short, a covariant index of a tensor is one which can be contracted with (summed over) a contravariant 
index of an input MVE to produce a meaningful resultant MVE. 


In relativity, the metric tensor has some negative signs. The scalar-product is a frame-invariant 
“interval.” No problem. All the math, raising, and lowering, works just the same. In special relativity, the 
metric ends up simply putting minus signs where you need them to get SR intervals. The covariant form of 
a vector has the minus signs “pre-loaded,” so it contracts directly with a contravariant vector to produce a 
scalar. 


Let’s use the sign convention where ny = diag(—1, 1, 1, 1). When considering the dual 1-forms for 
Minkowski space, the only unusual aspect is that the 1-form for time increases in the opposite direction as 
the vector for time. For the space components, the dual 1-forms increase in the same direction as the vectors. 
This means that: 


as it should for the Minkowski metric. 


Is a Transformation Matrix a Tensor? 


Sort of. When applied to a vector, it converts components from the “old” basis to the “new” basis. It is 
clearly a linear function of its argument. However, a tensor usually has all its inputs and outputs in the same 
basis (or tensor products of that basis). But a transformation matrix is specifically constructed to take inputs 
in one basis, and produce outputs in a different basis. Essentially, the columns are indexed by the old basis, 
and the rows are indexed by the new basis. It basically works like a tensor, but the transformation rule is that 
to transform the columns, you use a transformation matrix for the old basis; to transform the rows, you use 
the transformation matrix for the new basis. 


Consider a vector: 
il 2 3 
vV=ve,+ve,t+ve;. 


This is a vector equation, and despite its appearance, it is true in any basis, not just the (e1, e2, e3) basis. If 
we write e1, €2, €3 as vectors in some new (€x, ey, €,) basis, the vector equation above still holds: 


e, =(e,) e, +(e.) e, +(e,) e, 
e, =(e,) e, +(e)” ey + (e,)°e, 


e; =(€; y" e, +(e; y ey + (e; y e, 


1 2 3 
vV=ve+ve,+ve, 


=y! ey’ e, +(e,)" e, +(e) e.| +7 |(e2)" e, +(e)’ e, +(e,)° e.| + |(es)" e, +(e;)’e, +(e3)° e, | 


The vector v is just a weighted sum of basis vectors, and therefore the columns of the transformation 
matrix are the old basis vectors expressed in the new basis. E.g., to transform the components of a vector 
from the (e1, €2, €3) to the (€x, ey, €,) basis, the transformation matrix is: 


(e:)) (eo) (es) ee, €,-e€, €3-e, 
(e,)” (e,)” (e) |= ee, €)-e, e3-e, 


(e,)° (e,)° (e;)° Ce, €,°e, €3°e, 
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You can see directly that the first column is e; written in the x-y-z basis; the 2"4 column is e2 in the x-y-z basis; 
and the 3" column is e; in the x-y-z basis. 


How About the Pauli Vector? 


In quantum mechanics, the Pauli vector is a vector of three 2x2 matrices: the Pauli matrices. Each 2x2 
complex-valued matrix corresponds to a spin-1/2 operator in some x, y, or z direction. It is a 3"! rank object 
in the tensor product space of R? ® C? @ C’, i.e. xyz © spinor ® spinor. The xyz rank is clearly in a different 
basis than the complex spinor ranks, since xyz is a completely different vector space than spin-1/2 spinor 
space. However, it is a linear operator on various objects, so each rank transforms according to the 
transformation matrix for its basis. 


x y z 


(P10 JE 


It’s interesting to note that the term tensor product produces, in general, an object of mixed bases, and 
often, mixed vector spaces. Nonetheless, the term “tensor” seems to be used most often for mathematical 
objects whose ranks are all in the same basis. 


Cartesian Tensors 


Cartesian tensors aren’t quite tensors, because they don’t transform into non-Cartesian coordinates 
properly. (Note that despite their name, Cartesian tensors are not a special kind of tensor; they aren’t really 
tensors. They’re tensor wanna-be’s.) Cartesian tensors have two failings that prevent them from being true 
tensors: they don’t distinguish between contravariant and covariant components, and they treat finite 
displacements in space as vectors. In non-orthogonal coordinates, you must distinguish contravariant and 
covariant components. In non-Cartesian coordinates, only infinitesimal displacements are vectors. Details: 


Recall that in Cartesian coordinates, there is no distinction between contravariant and covariant 
components of a tensor. This allows a certain sloppiness that one can only get away with if one sticks to 
Cartesian coordinates. This means that Cartesian “tensors” only transform reliably by rotations from one set 
of Cartesian coordinates to a new, rotated set of Cartesian coordinates. Since both the new and old bases are 
Cartesian, there is no need to distinguish contravariant and covariant components in either basis, and the 
transformation (to a rotated coordinate system) “works.” 


For example, the moment of inertia “tensor” is a Cartesian tensor. There is no problem in its first use, 
to compute the angular momentum of a blob of mass given its angular velocity: 


I(@,_)=L => Li =[a! => 
Lv is a ae ao ae -, I _ 
Pl=)P, Py Pz|)e |=o| P40] P, +07] P, 
L r. kg Te. o* Lf 1k hie 


But notice that if I accepts a contravariant vector, then I’s components for that input vector must be covariant. 
However, I produces a contravariant output, so its output components are contravariant. So far, so good. 


1 1 1 
But now we want to find the kinetic energy. Well, lo” = . L-@= E I(o, >) -@. The dot product 


is a dot product of two contravariant vectors. To evaluate that dot product, in a general coordinate system, 
we have to use the metric: 


1 


oe (ar 
KE= 3 1jo@ = 51 ie! sino 


| ar 
#—l'o'o'. 
2 J 
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However, in Cartesian coordinates, the metric matrix is the identity matrix, the contravariant components 
equal the covariant components, and the final “not-equals” above becomes an “equals.” Hence, we neglect 
the distinction between contravariant components and covariant components, and “incorrectly” sum the 
components of I on the components of @, even though both are contravariant in the 2"4 sum. 


In general coordinates, the direct sum for the dot product doesn’t work, and you must use the metric 
tensor for the final dot product. 


Example of failure of finite displacements: TBS: The electric quadrupole tensor acts on two copies of 
the finite displacement vector to produce the electric potential at that displacement. Even in something as 
simple as polar coordinates, this method fails. 


The Real Reason Why the Kronecker Delta Is Symmetric 


TBS: It is a mixed tensor, 6“, but symmetry can only be assessed by comparing interchange of two 
indices of the same “up-” or “down-ness” (contravariance or covariance). We can lower, say a, in 6%, with 
the metric: 


Sap = 8.ay° p = Bap: 


The result is the metric g,g, which is always symmetric. Hence, 6% is a symmetric tensor, but not because 
its matrix looks symmetric. In general, a mixed rank-2 symmetric tensor does not have a symmetric matrix 
representation. Only when both indices are up or both down is its matrix symmetric. 


The Kronecker delta is a special case that does not generalize. 


Things are not always what they seem. 
Tensor Appendices 


Pythagorean Relation for 1-forms 


Demonstration that 1-forms satisfy the Pythagorean relation for magnitude: 


a~ 
a~ a~ 
a~ 
. 1/b 
. unit 
unit 
vector 
vector 
l/a 


Odx+1 dy ldx+1dy 2dx+1dy adx+b dy 
la~| = 1 |a~| = V2 |a~| = V5 |a~| = V(a2+b?) 


Examples of three 1-forms, and a generic 1-form. Here, dx is the x basis 1-form, and dy is the y 
basis 1-form. 


From the diagram above, a max-crossing vector (perpendicular to the planes of a~) has (x, y) components 
(1/b, 1/a). Dividing by its magnitude, we get a unit vector: 


lgaly 
max crossing unit vector u = b ; a Note that dx(x) =1, and dy(y) =1. 
1 
Pa 


The magnitude of a 1-form is the scalar resulting from the 1-form’s action on a max-crossing unit vector: 
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Here’s another demonstration that 1-forms satisfy the Pythagorean relation for magnitude. The 
magnitude of a 1-form is the inverse of the plane spacing: 


AOXA~ ABOA => Os BO 
OA BA 


1 1 
B A ran 
= ox PUN ae 


BA 1 
BP at 
1 1 
O 1 Pow 1 1 [2,32 
a =ab + = Va +b 
A la| OX T oe p 
a 


Figure 16.7 Demonstration that 1-forms satisfy the Pythagorean relation for magnitude. 


Sle 


Geometric Construction Of The Sum Of Two 1-Forms: 


Example of a~ + b~ 


Construction of a~ + b~ 


a~(x) =2 a~(v,) =1 a~(v,) =0 
b~(x) = 1 b~(v,) =0 b~(v,) = 1 
(a~ + b~)(x) =3 (a~ + b~)(v,) = 1 (a~ + b~)(v,) = 1 


Figure 16.8 Geometric construction of the sum of two 1-forms. 


To construct the sum of two 1-forms, a~ + b~: 


1. Choose an origin at the intersection of a plane of a~ and a plane of b~. 
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2. Draw vector v, from the origin along the planes of b~, so b~(v.) = 0, and of length such that 
a~(Va) = 1. [This is the dual vector of a~.] 


3. Similarly, draw vp from the origin along the planes of a~, so a~(Vp) = 0, and b~(vp) = 1. [This is the 


dual vector of b~.] 


4. Draw aplane through the heads of v. and vp (black above). This defines the orientation of (a~ + b~). 


Draw a parallel plane through the common point (the origin). This defines the spacing of planes of 


(a~ + b~). 


6. Draw all other planes parallel, and with the same spacing. This is the geometric representation of 


(a~ + b~). 


Now we can easily draw the test vector x, such that a~(x) = 2, and b~(x) = 1. 


“Fully Anti-symmetric” Symbols Expanded 


Everyone hears about them, but few ever see them. They are quite sparse: the 3-D fully anti-symmetric 


symbol has 6 nonzero values out of 27; the 4-D one has 24 nonzero values out of 256. 


3-D, from the 6 permutations, ijk: 123+, 132-, 312+, 321-, 231+, 213- 


k=1 k=2 k =3 
0 0 O| fo -1] fo 1 

Eq =|0 0 1],/0 0 O|,/-1 0 

-1 0/ |1 0 0 0 0 

4-D, from the 24 permutations, afyd: 

0123+ 0132- 0312+ 0321- 0231+ 
1023- 1032+ 1302- 1320+ 1230- 
2013+ 2031- 2301+ 2310- 2130+ 
3012- 3021+ 3201- 3210+ 3120- 

B=0 B= 

(0 0 0 0 00 0 0 

Exp = |0 0 0 0 00 0 0 
a=-0 |0 00 0/ |0 0 0 1 
10 0 0 0 00 -1 0 

(0 0 0 0|] f0 0 0 0 

aa 000 0} JO + 
000 -1/'|0 000 

lo 0 1 O} [0 0 0 0 

(0 0 0 0] fo 0 0 -1] 

eee 00 01} |0 00 0 

0 0 0 0/|0 0 0 0 

10 -1 0 O} {1 0 0 O| 

(0 0 0 0] f0 0 1 0] 

suse 00-1 0} |0 000 
0 1 0 O} J-1 000 

10 0 0 O] [0 0 0 O 
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0213- 
1203+ 
2103- 


ooroecoeocmlUcrW Oo lLlO 


oo eo 


ooo Oo 


ooo eo 


ooo fo 


oo fo oO hI OO CO Ke 


B=3 

[0 0 0 0 
0 0 10 
0 -1 0 0 
10 0 0 0] 
[0 0 -1 O| 
00 0 0 
10 0 0 
10 0 O O| 
(Oo 1 0 0] 
-1 00 0 
0 00 0 
10 0 0 0 
000 0 

000 0 

000 0 

0000 
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Metric? We Don’t Need No Stinking Metric! 


Examples of Useful, Non-metric Spaces 


Non-metric spaces are everywhere. A non-metric space has no concept of “distance” between arbitrary 
points, or even between arbitrary “nearby” points (points with infinitesimal coordinate differences). 
However: 


Non-metric spaces have no concept of “distance,” 
but many still have a well-defined concept of “area,” in the sense of an integral. 


For example, consider a plot of velocity (of a particle in 1D) vs. time (below, left). 


velocity pressure momentum 


displacement 


time volume position 


Some useful non-metric spaces: (left) velocity vs. time; (middle) pressure vs. volume; 
(right) momentum vs. position. In each case, there is no distance, but there is area. 


The area under the velocity curve is the total displacement covered. The area under the P-V curve is the work 
done by an expanding fluid. The area under the momentum-position curve (p-q) is the action of the motion 
in classical mechanics. Though the points in each of these plots exist on 2D manifolds, the two coordinates 
are incomparable (they have different units). It is meaningless to ask what is the distance between two 
arbitrary points on the plane. For example, points A and B on the v-t curve differ in both velocity and time, 
so how could we define a distance between them (how can we add m/s and seconds)? 


In the above cases, we have one coordinate value as a function of the other, e.g. velocity as a function of 
time. We now consider another case: rather than consider the function as one of the coordinates in a manifold, 
we consider the manifold as comprising only the independent variables. Then, the function is defined on that 
manifold. As usual, keeping track of the units of all the quantities will help in understanding both the physical 
and mathematical principles. 


For example, the speed of light in air is a function of 3 independent variables: temperature, pressure, and 
humidity. At 633 nm, the effects amount to speed changes of about +1 ppm per kelvin, -0.4 ppm per mm- 
Hg pressure, and +0.01 ppm per 1% change in relative humidity (RH) (see http://patapsco.nist.gov/ 
mel/div821/Wavelength/Documentation.asp#CommentsRegardingInputstotheEquations): 


S(T, P, H)=s9t+aT—bP+cH. 


where a ~ 300 (m/s)/k, b = 120 (m/s)/mm-Hg, and c ~ 3 (m/s)/% are positive constants, and the function 
s is the speed of light at the given conditions, in m/s. Our manifold is the set of TPH triples, and s is a 
function on that manifold. We can consider the TPH triple as a (contravariant, column) vector: (T, P, H)'. 
These vectors constitute a 3D vector space over the field of reals. s(-) is a real function on that vector space. 


Note that the 3 components of a vector each have different units: the temperature is measured in kelvins 
(K), the pressure in mm-Hg, and the relative humidity in %. Note also that there is no metric on (T, P, H) 
space (which is bigger, 1 K or 1 mm-Hg?). However, the gradient of s is still well defined: 

s-Sar+ Sap+2dn-adt—pdp +c dn. 
oT oP OH 


Vv 


What are the units of the gradient? As with the vectors, each component has different units: the first is in 
(m/s) per kelvin; the second in (m/s) per mm-Hg; the third in (m/s) per %. The gradient has different units 
than the vectors, and is not a part of the original vector space. The gradient, Vs, operates on a vector (T, P, 
H)' to give the change in speed from one set of conditions, say (To, Po, Ho) to conditions incremented by the 
vector (To + T, Po + P, Ho + A). 
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One often thinks of the gradient as having a second property: it specifies the “direction” of steepest 
increase of the function, s. But: 


Without a metric, “steepest” is not defined. 


Which is steeper, moving one unit in the temperature direction, or one unit in the humidity direction? In 
desperation, we might ignore our units of measure, and choose the Euclidean metric (thus equating one unit 
of temperature with one unit of pressure and one unit of humidity); then the gradient produces a “direction” 
of steepest increase. However, with no justification for such a choice of metric, the result is probably 
meaningless. 


What about basis vectors? The obvious choice is, including units, (1 K, 0 mm-Hg, 0 %)", (0 K, 1 mm- 
Hg, 0 %)', and (0 K, 0 mm-Hg, | %)', or omitting units: (1, 0, 0), (0, 1, 0), and (0, 0, 1). Note that these are 
not unit vectors, because there is no such thing as a “unit” vector, because there is no metric by which to 
measure one “unit.” Also, if I ascribe units to the basis vectors, then the components of an arbitrary vector 
in that basis are dimensionless. 


Now let’s change the basis: suppose now I measure temperature in some unit equal to /2 K (almost the 
Rankine scale). Now all my temperature measurements “double”, i.e. Trew = 2 Toia. In other words, (% K, 0, 
0) is a different basis than (1 K, 0, 0)7. As expected for a covariant component, the temperature component 
of the gradient (Vs)r is cut in half if the basis vector “halves.” So when the half-size gradient component 
operates on the double-size temperature vector component, the product remains invariant, i.e., the speed of 
light is a function of temperature, not of the units in which you measure temperature. 


The above basis change was a simple change of scale of one component in isolation. The other common 
basis change is a “rotation” of the axes, “mixing” the old basis vectors. 


Can we rotate axes when the units are different for each component? Surprisingly, we can. 


H H 
e3 
P P 
e 


T £17 


We simply define new basis vectors as linear combinations of old ones, which is all that a rotation does. 
For example, suppose we measured the speed of light on 3 different days, and the environmental conditions 
were different on those 3 days. We choose those measurements as our basis, say e; = (300 K, 750 mm-Hg, 
20%), e2 = (290 K, 760 mm-Hg, 30 %), and e3 = (290 K, 770 mm-Hg, 10 %). These basis vectors are not 
orthogonal, but are (of course) linearly independent. Suppose I want to know the speed of light at (296 K, 
752 mm-Hg, 18 %). I decompose this into my new basis and get (0.6, 0.6, —0.2). I compute the speed of 


light function in the new basis, and then compute its gradient, to get d; + d,@ + dé . I then operate on 
the vector with the gradient to find the change in speed: As = Vs(0.6, 0.6, —0.2) = 0.6 d; + 0.6 dz — 0.2 ds. 


We could extend this to a more complex function, and then the gradient is not constant. For example, a 
more accurate equation for the speed of light is: 


s(T, P,H) = - r+ gH ((T =a73y +160) 


where f = 7.86 x 104 and g ~ 1.5 x 10"!! are constants. Now the gradient is a function of position (in TPH 
space), and there is still no metric. 


Comment on the metric: In desperation, you might define a metric, i.e. the length of a vector, to be As, 
the change in the speed of light due to the environmental changes defined by that vector. However, such a 
metric is in general non-Euclidean (not a Pythagorean relationship), indefinite (non-zero vectors can have 
zero or negative “lengths”), and still doesn’t define a meaningful dot product. Our more-accurate equation 
for the speed of light provides examples of these failures. 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 296 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


References: 

[Knu] Knuth, Donald, The Art of Computer Programming, Vol. 2: Seminumerical Algorithms, 
2nd Ed., p. 117. 

[Mic] Eric L. Michelsen, Quirky Quantum Concepts, Springer, 2014. ISBN-13: 978- 
1461493044. 

[Sch] Schutz, Bernard, A First Course in General Relativity, Cambridge University Press, 1990. 

[Sch2] Schutz, Bernard, Geometrical Methods of Mathematical Physics, Cambridge University 
Press, 1980. 

[Tal] Talman, Richard, Geometric Mechanics, John Wiley and Sons, 2000. 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 297 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


17 Differential Geometry 


Manifolds 


A manifold is a “space”: a set of points with coordinate labels. We are free to choose coordinates many 
ways, but a manifold must be able to have coordinates that are real numbers. We are familiar with “metric 
manifolds”, where there is a concept of distance. However, there are many useful manifolds which have no 
metric, e.g. phase space (see “We Don’t Need No Stinking Metric” above). 


Even when a space is non-metric, it still has concepts of “locality” and “continuity.” 


Such locality and continuity are defined in terms of the coordinates, which are real numbers. It may also 
have a “volume”, e.g. the oft-mentioned “phase-space volume.” It may seem odd that there’s no definition 
of “distance,” but there is one of “volume.” Volume in this case is simply defined in terms of the coordinates, 
dV = dx, dx2 dx ..., and has no absolute meaning. 


Coordinate Bases 


Coordinate bases are basis vectors derived from coordinates on the manifold. They are extremely useful, 
and built directly on basic multivariate calculus. Coordinate bases can be defined a few different ways. 
Perhaps the simplest comes from considering a small displacement vector on a manifold. We use 2D polar 
coordinates in (7, 9) as our example. A coordinate basis can be defined as the basis in which the components 
of an infinitesimal displacement vector are just the differentials of the coordinates: 


(Left) Coordinate bases: the components of the displacement vector are the differentials of the 
coordinates. (Right) Coordinate basis vectors around the manifold. 


Note that eg (the @ basis vector) far from the origin must be bigger than near, because a small change in 
angle, d@, causes a bigger displacement vector far from the origin than near. The advantage of a coordinate 
basis is that it makes dot products, such as a gradient dotted into a displacement, appear in the simplest 
possible form: 


Given f(r,0), a vi r0-dp-(L+F).(arao)=Lars Lao. 


The last equality is assured from elementary multivariate calculus. 


The basis vectors are defined by differentials, but are themselves finite vectors. Any physical vector, 
finite or infinitesimal, can be expressed in the coordinate basis, e.g., velocity, which is finite. 


“Vectors” as derivatives: There is a huge confusion about writing basis “vectors” as derivatives. From 
our study of tensors (earlier), we know that a vector can be considered an operators on a |-form, which 
produces a scalar. We now describe how vector fields can be considered operators on scalar functions, which 
produce scalar fields. I don’t like this view, since it is fairly arbitrary, confuses the much more consistent 
tensor view, and is easily replaced with tensor notation. 


We will see that in fact, the derivative “basis vectors” are operators which create 1-forms (dual-basis 
components), not traditional basis vectors. The vector basis is then implicitly defined as the dual of the dual- 
basis, which is always the coordinate basis. In detail: 
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We know from the “Tensors” chapter that the gradient of a scalar field is a 1-form with partial derivatives 
as its components. For example: 


af of 1) L142 pe Gt ap 


ciocage where o!,o7,o° are basis 1-forms 
Ox Oy OZ ox oy Oz 


Rea 


Many texts define vectors in terms of their action on scalar functions (aka scalar fields), e.g. [Wald p15]. 
Given a point (x, y, z), and a function f(x, y, z), the definition of a vector v amounts to 


ve(vv,v7] such that vl f(x, y.z)|=v- Vf =v Pay o +v* 2 (a scalar field) 


Roughly, the action of v on f produces a scaled directional derivative of f: Given some small displacement 
dt, as a fraction of |v| and in the direction of v, v tells you how much f will change when moving from (x, y, z) 
to (x + v'dt, y + wdt, z + vdt): 


- Mt 
df=v[fldt = or 7 v[f].- 


If f is time, and v is a velocity, then v[f] is the time rate of change of f. While this notation is compact, I’d 
rather write it simply as the dot product of v and Vf, which is more explicit, and consistent with tensors: 
d, 
df =v-Vf dt or DF ii 
dt 
The definition of v above requires an auxiliary function f, which is messy. We remove f by redefining v 
as an operator: 


{xO yO... 0 
V=lv +v- +v (an operator) . 
Ox oy Ox 


Given this form, it looks like 0/0x, 0/dy, and 0/dz are some kind of “basis vectors.” Indeed, standard 
terminology is to refer to 0/Ox, O/Oy, and 0/dz as the “coordinate basis” for vectors, but they are really 
operators for creating /-forms! Then: 


= x Of y of 2 Of _ i " 
v[f]=v es ant >, v' (Vf); (a scalar field) . 


i=X,y,Z 


The vector v contracts directly with the 1-form Vf (without need of any metric), hence v is a vector implicitly 
defined in the basis dual to the 1-form Vf. 


Note that if v = v(x, y, z) is a vector field, then: 
v[ f (x.y.z) ]= VG, y, 2) VEY, 2) (a scalar field) . 


These derivative operators can be drawn as basis vectors in the usual manner, as arrows on the manifold. 
They are just the coordinate basis vectors shown earlier. For example, consider polar coordinates (7, 6): 
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Figure 17.1 Examples of coordinate basis vectors around the manifold. e; happens to be unit 
magnitude everywhere, but eg is not. 


The manifold in this case is simply the flat plane, R?. The r-coordinate basis vectors are all the same size, 
but have different directions at different places. The @ coordinate basis vectors get larger with r, and also 
vary in direction around the manifold. 

Covariant Derivatives 


Notation: Due to word-processor limitations, the following two notations are equivalent: 
h()=h(), roe 
The following description is similar to one in [Sch]. 


We start with the familiar concepts of derivatives, and see how that evolves into the covariant derivative. 
Given a real-valued function of one variable, f(x), we want to know how f varies with x near a value, a. The 
answer is the derivative of f(x), where 


df=f'(a) dx and therefore fla + dx) = fla) + df=fla) + f'(@ dx. 


Extending to two variables, g(x, y), we'd like to know how g varies in the 2-D neighborhood around a 
point (a, b), given a displacement vector dr = (dx, dy). We can compute its gradient: 


Vg = Bax Bay and therefore g(a + dx,b + dy) = g(a,b) + Vg(dr). 
x 


The gradient is also called a directional derivative, because the rate at which g changes depends on the 
direction in which you move away from the point (a, b). 


The gradient extends to a vector valued function (a vector field) h(x, y) = (x, yit+ WO, y)j: 


x y x y 
Tie ee ay a sa" e,, and Ce aa! e,, 
Ox oy Ox Ox Ox - Oy oy Oy ~ 
oh*  Ooh* F oh* oh* 
Ix 
0. a) 0. a) 
mye mee Owe Se ~S Sy nO ae) cr ce 
Ox oy Ox oy 
oh> dh’ a oh’ oh’ 
| Ox oy | | Ox | | oy 


We see that the columns of Vh are vectors which are weighted by dx and dy, and then summed to produce a 
vector result. Therefore, Vh is linear in the displacement vector dr = (dx, dy). This linearity insures that it 
transforms like a duck . . . I mean, like a tensor. Thus Vh is a rank-2 (';) tensor: it takes a single vector input, 
and produces a vector result. 
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So far, all this has been in rectangular coordinates. Now we must consider what happens in curvilinear 
coordinates, such as polar. Note that we’re still in a simple, flat space. (We’ll get to curved spaces later). 
Our goal is still to find the change in the vector value of h( ), given an infinitesimal vector change of position, 
dx = (dx', dx”). We use the same approach as above, where a vector valued function comprises two (or n) 
real-valued component functions: h(x!, x7) =f! (x1, x)e, +h? (x! x? )e,. However, in this general case, the 
basis vectors are themselves functions of position (previously the basis vectors were constant everywhere). 
So h() is really: 


h(x’, x7) = Al (x! x7)e, (x1 x7) +? (x! x7)en (x1, x7) . 


Hence, partial derivatives of the component functions alone are no longer sufficient to define the change in 
the vector value of h(); we must also account for the change in the basis vectors. 


ha! x 
mea 1 1 24 x2 ee ra 
es =f [ie +dx!, x?+dx7) e-— 
* & Y hQxl+dx!, x?+dx") % vane 
/ Neos. WA aaa £ len 
en(x!, x?) |e (xl4da!, x24+dx?) se 
ip A: oped 
dx = (dx!, dx2) dx = (dx', dx’) 
(a) Components constant, (b) Vector constant, but 
but vector changes components change 


Figure 17.2 The distinction between a component of a derivative, and a derivative of a component. 


Note that a component of the derivative is distinctly not the same as the derivative of the component (see 
Figure 17.2). Therefore, the ith component of the derivative depends on all the components of the vector 
field. 


We compute partial derivatives of the vector field A(x, x”) using the product rule: 


2 
ott OF e(xtx2) + hl (x! She Seal x) +h? (xx pe 


ax! ax! ax! ax! 
j ; le. 
= BY eel agit 
jal Ox Ox 


This is a vector equation: all terms are vectors, each with components in all n basis directions. This is 
equivalent to n numerical component equations. Note that (Oh/Ox,) has components in both (or all n) 
directions. Of course, we can write similar equations for the components of the derivative in any basis 
direction, ex: 

oh oh de, , Oh? 2, de 

= Se (x1, x7) +h (x! x7) + ey (x1 x7) 1? (x1? 

ox” = Ox ax*  ax* ax* 


->{ e (xx) +hi(2', aie “4 
ax* 


1 


Because we must frequently work with components and component equations, rather than whole vector 
equations, let us now consider only the ith component of the above: 


Sls axk + We! “(Ss ‘) 


The first term moves out of the summation because each of the first terms in the summation of eq. (1) are 
vectors, and each points exactly in the e; direction. Only the j = i term contributes to the ith component; the 
purely e; directed vector contributes nothing to the ith component when j # i. 
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Recall that these equations are true for any arbitrary coordinate system; we have made no assumptions 
about unit length or orthogonal basis vectors. Note that 


h 
us = (Vh) , = the kth (covariant) component of Vh. 


OX, 


Since Vh is a rank-2 tensor, the kth covariant component of Vh is the Ath column of Vh: 
(2) (2) 
ax' ax? 

(4) (*y 
ox! ax?) | 


Vh= 


Since the change in h() is linear with small changes in position, 
dh=Vh(dx), where dx =(dx',dx’). 


Going back to Equations (1) and (2), we can now write the full covariant derivative of h() in 3 ways: 
vector, verbose component, and compact component: 


Oh pic p27 MD i ay pi 
Vh), =V,h=—+ hi (x ,x°)— = —— + hi(x,x")> Tee; 
( in k ax* 2 ax* ax* >, » ik 


a eee 
V,h)' =——+ 9 hh’ (x',x7)| 
( : ) ax* 2 ax* 
: i a . de, \' . Ce ; 
(van) = sat, where Ty =|— > rye; =— 
Ox Ox Ox 


Aside: Some mathematicians complain that you can’t define the Christoffel symbols as derivatives of basis 
vectors, because you can’t compare vectors from two different points of a manifold without already having the 
Christoffel symbols (aka the “connection”). Physicists, including Schutz [Sch], say that physics defines how to 
compare vectors at different points of a manifold, and thus you can calculate the Christoffel symbols. In the end, it 
doesn’t really matter. Either way, by physics or by fiat, the Christoffel symbols are, in fact, the derivatives of the 
basis vectors. 


Christoffel Symbols 


Christoffel symbols are the covariant derivatives of the basis vector fields. We use ordinary plane polar 
coordinates (7, 9) as an example. 


r e, A+d0e, e, 
_— = =, - = de, = 008 
de.=0 e r 
r+dr &, r Vv gO e, r 
(2): —<—$— > (b) 
de 


@ 
rte 0 + dO Coffe 
area : oi he » ” dey =-d0t 
(c) = FEO IS (d) 
Figure 17.3 (a) Derivative of e, in the r direction is the zero vector. (b) Derivative of e,; in the 0 


direction is @. (c) Derivative of eg in the r direction is the zero vector. (d) Derivative of eg in the 0 


direction is —f. 


Figure 17.3 shows the derivatives of the r basis vector in the r direction, and in the @ direction. From 
this, we can fill in four components of the Christoffel symbols: 
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0 a FU 00 
ap apse"), “eet ee | a = eS 
ar o) 30 1 01 


Similarly, the derivatives of the @ basis vector are: 


0 0 0 0 


0@ -1 0 -l 


These are the 8 components of the Christoffel symbols. In general, in n-dimensional space, each basis vector 
has a derivative in each direction, with n components, for a total of n? components in Ig, . 


Visualization of n-Forms 
TBS: 1-forms as oriented planes 
2-forms (in 3 or more space) as oriented parallelograms 
3-forms (in 3 or more space) as oriented parallelepipeds 


4-forms (in 4-space): how are they oriented?? 


Review of Wedge Products and Exterior Derivative 


This is a quick insert that needs proper work. ?? This section requires understanding outer-products, and 
anti-symmetrization of matrices. 


Wedge Products 


We can get an overview of the meaning of a wedge product from a simple example: the wedge product 
of two vectors in 3D space. We first review two preliminaries: anti-symmetrization of a matrix, and the outer 
product of two vectors. 


Recall that any matrix can by written as a sum of a symmetric and an anti-symmetric matrix (much like 
any function can be written as a sum of an even and an odd function): 


B=B,+B, ~~ where B,=By, By =—Bi. (17.1) 


For example: 


1 2 3 1 35 0 -l -2 
4 5 6)=/3 5 7]+}/1 O -l). 
7 8 9 5 7 9 2 1 =O 


We can derive explicit expressions for the symmetric and anti-symmetric parts of a matrix from (17.1): 


B+B’ =B, +B, +Bi +B) =2Bs, B, =5(B+B") 
(17.2) 
ae 1 T 
Similarly: B,= 5(B-B ). 
Also recall that the outer product of two vectors is a matrix (in this case, a rank-2 tensor): 
a,b, a,b, a,b, i) falta 4 5 6 
a®b=[al|b” |= ayb, a,b, ayb, |. E.g., 2/@/5/=/2 [4 =) 6|= 8 10 12}. 
ab, a,b, a,b, 3 6 3 12 15 18 


Finally, the wedge product of two vectors is the anti-symmetric part of the outer product: 
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anb=—|a@b-(aeb)’ |. 


To simplify our notation, we can define a linear operator on a matrix which takes the anti-symmetric part. 
This is the anti-symmetrization operator: 


A(B)=5(B-B”) > avb=A(a@b). 


Commutation: A crucial property of the wedge product is that it is anti-commutative: 


anvb=-baa. 


This follows directly from the fact that the outer product is not commutative: b@®a= (a @b)’ . Then the 


anti-symmetric part of a transposed matrix is the negative of the anti-symmetric part of the original matrix: 


baa=A(b®a)= A] (a@b)’ |--A(a@b)=-arb.oe 


Tensor Notation 


In tensor notation, the symmetric and anti-symmetric parts of a matrix are written: 
1 1 
ap _ ap Ba ap _ ap _ pPa 
B; 5(8 +B ), By 5(2 B i 


Note that both a and # are free indexes, so (in a 3 dimensional space) each of these is 9 separate equations. 
They are fully equivalent to the matrix equations (17.2). 


1D 


I don’t know of any meaning for a wedge-product in 1D, where a “vector” is just a signed number. The 
“direction” of a vector is either + or —. Also, the 1D exterior derivative is a degenerate case, because the 
“exterior” of a line segment is just the 2 endpoints, and all functions are scalar functions. In all higher 
dimensions, the “exterior” or boundary of a region is a closed path/ surface/ volume/ hyper-volume/etc. In 
1D the boundary of a line segment cannot be closed. So instead of integrating around a closed exterior (aka 
boundary), we simply take the difference in the function value at the endpoints, divided by a differential 
displacement. This is simply the ordinary derivative of a function, f (x). 


2D 


The exterior derivative of a scalar function f(x, y) follows the 1D case, and is similarly degenerate, where 
the “exterior” is simply the two endpoints of a differential displacement. Since the domain is a 2D space, the 
displacements are vectors, and there are 2 (partial) derivatives, one for displacements in x, and one for 
displacements in y. Hence the exterior derivative is just the one-form “gradient” of the function: 


df (x, y) =" gradient" = & ax + aay 
Ox oy 


In 2D, the wedge product of the basis 1-forms, dx A dy , is a two-form, which accepts two vectors to 


produce the signed area of the parallelogram defined by them. A signed area can be + or —; a counter- 
clockwise direction is positive, and clockwise is negative. 
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dx A dy(v, w) = signed area defined by (v, w) = —dx A dy(v, Vv) 
= dx(#)dy(i) — dy(W)dx(iv) 

dx(¥)  dx(v) 

dy(¥) dy() 


x x 
Vv W 


= det = det 


vw 


The exterior derivative of a 1-form is the ratio of the closed path integral of the 1-form to the area of the 
parallelogram of two vectors, for infinitesimal vectors. This is very similar to the definition of curl, only 
applied to a 1-form field instead of a vector field. 


o,(x,y+dy) 
side 3 
dy dY+- ide 4 side 2 
ss (x,y) @,(x+dx,y) 
O(%Y) dx dx (c) Path integrals from 2 
(a) (b) adjacent areas add 


Figure 17.4 (a) 2D closed-path integral: contributions from x displacements. (b) Contributions 
from y displacements. (c) Path integrals from adjacent areas of any shape add. 


Consider the horizontal and vertical contributions to the path integral separately: 


@(x, y) = ©, (% y)dx + @, (x, y)dy Fay) dr = (dx, dy) 
oa) ip | &(d?) = @,(r)dx—o, (1 + dy)dx = - dy dx 
y 
1 3 
eas ee 0a, 
oa) + [awry = o,(r + dx)dy — @, (r)dy = ——dx dy 
2 4 Ox 


The horizontal integrals (sides 1 & 3) are linear in dx, because that is the length of the path. They are linear 
in dy, because dy is proportional to the difference in w;. Hence, the contribution is linear in both dx and dy, 
and therefore proportional to the area (dx)(dy). 


A similar argument holds for the vertical contributions, sides 2 & 4. Therefore, the path integral varies 
proportionately to the area enclosed by two orthogonal vectors. 


It is easy to show this is true for any two vectors, and any shaped area bounded by an infinitesimal path. 
For example, when you butt up two rectangles, the path integral around the combined boundary equals the 
sum of the individual path integrals, because the contributions from the common segment cancel from each 
rectangle, and hence omitting them does not change the path integral. The area integrals add. 


3D 
In 3D, the wedge product of the basis 1-forms is a 3-form, that can either: 


1. Accept 2 vectors to produce an oriented area; it doesn’t have a sign, it has a direction. Analogous 
to the cross-product. Or, 


2. Accept 3 vectors (u, v, w below) to produce a signed volume. 
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dx A dy A dz(ii, Vv, w) = signed volume defined by (u,v, w) 


dx(i) dx(¥) dx(W) us vt w* 
=det|dy(i) dy(v¥) dy(w)|/=det\u” v’ w’). 
dz(i) dz(¥) dz(w) ee ak age 


Being a 3-form (all wedge products are p-forms), the wedge-product is anti-symmetric in its arguments: 
dx A dy A dz(i, ¥, #) =-dx Ady Adz(ii, ¥,¥), etc. 


The exterior derivative of a scalar or 1-form field is essentially the same as in the 2D case, except that 
now the areas defined by vectors are oriented instead of simply signed. In this case, the “exterior” is a closed 
surface; the “interior” is a volume. 
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18 Math That’s Fun and Interesting 


“The first time we use a particular mathematical process, we call it a ‘trick’. The second time, it’s a 
‘device’. The third time, it’s a ‘method’.” — Unknown (to me). 


Here are some math “methods” that either come up a lot and are worth knowing about, or are just fun 
and interesting. 


Math Tricks That Come Up A Lot 
The Gaussian Integral 


sod 2. 
You can look this up anywhere, but here goes: we evaluate the basic integral, e* dx, and throw 


—0o 
in an ‘a’ with x? at the end by a simple change of variable. First, we square the integral, then rewrite the 
second factor calling the dummy integration variable y instead of x: 


(Jace?) =(foaee fo aer fo af wel 


This is just a double integral over the entire x-y plane, so we can switch to polar coordinates. Note that the 
exponential integrand is constant at constant 7, so we can replace the differential area dx dy with 2zr dr: 


d(area) = 2ar dr 
dr 


co 00 _f 32 2 fos) 2 77 
Let r? =x? +y? > ) dx [ dye ce car | dr 2xre" =-al | =. 
% =a 0 0 


: i) 
(I dxe* ) = > J dxe* =n, and | dxe®™ = a . 
00 —0o 00 a 


Math Tricks That Are Fun and Interesting 


Continuous Infinite Crossings 


The following function has an infinite number of zero crossings near the origin, but is everywhere 
continuous (even at x = 0). That seems bizarre to me. Recall the definition: 


(x) is continuous at a iff lim f(x) = f(a). 
xa 


Then let 
x sin (=) F x #0 
f@m= a 
0, x=0 
lim f(x) =0= f(0) Ff (O) is continuous ! 
x0 
Picture 
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Technique for Integration 


} qx _ tps. 


sin x 


Phasors 


Phasors are complex numbers that represent sinusoids. The phasor defines the magnitude and phase of 
the sinusoid, but not its frequency. See Funky Electromagnetic Concepts for a full description. 


Miserables Coset 


Groups are the fundamental structures of symmetries, and are crucial to advanced physics. Groups also 
underlie discrete algebra, needed for encryption and error-control coding. One of the most astounding results 
of finite group theory is that the size of any subgroup divides evenly into the size of the parent group. The 
short proof illustrates important and useful concepts in group theory, is accessible to non-mathematicians, 
and proceeds roughly as follows: 


e = Define a “coset” 

e Show that cosets are (a) the same size as the subgroup S, (b) don’t overlap, and (c) cover the 
whole parent group G. 

e Therefore, the size of the subgroup S divides the size of G. 


Though the concept of a coset is straightforward, in my experience discussions of cosets are severely 
lacking (and sometimes incorrect). We here attempt to close that gap, and prove the above theorem, called 
Lagrange’s Theorem. 


Brief review of groups: A group is a set of elements, and an operator (here we call it “multiplication’”), 
with 4 defining properties (see our detailed definitions elsewhere): 


Closure: the “product” of any two group elements is a group element. 
An identity element, which we call “e”: for every element g, g-e = g (= eg). 
Every element g has an inverse such that gg! = e (= g!g). 


The group operator is associative: a(bc) = (ab)c. 


These simple properties produce an enormously rich structure and theory. Note that the group operator need 
not be commutative, and often is not. Nonetheless, even for noncommutative groups, g-e = e-g, and gg! = 
g'g. (Commutative groups often write the operator as “+”, and the identity as “0”.) 


An example of a group is the set {1, 2} under multiplication mod 3. The identity is “1”, and its group 
operation table is below left: 


Lucy‘ Fred 


A group is fully defined by its operation table. 


The group elements are abstract entities, not really numbers, though it is sometimes convenient to represent 
the elements as numbers. By simple renaming, the two tables above are equivalent, and define the same 
group. (Some groups have elements that can be thought of as operators, e.g., rotations or reflections, with 
the group-operation being the composition of the element operators.) 


Subgroups: Sometimes a group is structured so that a subset of the elements, with the same group 
operator, forms a smaller group. For example, the numbers 0 - 11 form a group under addition mod 12, with 
identity 0. Then the set S = {0, 4, 8} forms a subgroup of the parent group. A subgroup inherits the identity 
and associativity from the parent. Notice that the parent group has size (aka “order’’) 12, and the subgroup 
has order 3. Three divides 12, and this is an example of Lagrange’s Theorem. 
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[Mathematicians call the size of a group the order, but the size of a set is its cardinality, thus defining two 
new and different words for the same idea as the existing word “size.” In a surprising show of consistency, though, 
the size (order) of a group G is written |G], and the size (cardinality) of a set C is written |C]. 


Another nit: strictly speaking, a group is a subgroup of itself, so it is the only subgroup that isn’t “smaller” than 
the group.] 


Cosets: Given a group G, a subgroup S, and a group element g, the “left coset of S with respect to g” is: 


gS = { 8 8S}, 85, a where  s; = the non-identity elements of S . 


In our subgroup example above, the coset of S with respect to 0 is just {0, 4, 8}. The coset with respect to | 
is {1,5,9}; w.r.t2 is {2,6, 10}; and w.r.t. 3 is {3, 7, 11}. You might think there are 8 more cosets, but the 
other elements of G (4 through 11) each reproduce one of these 4 cosets. For example, the coset w.r.t. 5 = 
{5, 9, 1}, which is the same as the coset w.r.t. 1. Thus a subgroup S of a group G defines a set of cosets. 


We now prove three properties of the set of cosets of a subgroup. (a) The size of a coset is the size of 
the subgroup. By definition, a coset consists of one element from each of the elements of S. Furthermore, 
the coset elements are distinct, because the group operation is invertible: if gs; = gsz2 , then premultiplying by 
g! proves that s; = 52. Therefore, if 5; # 52, then gs; # gs2. Hence, the size of the coset is the size of S. 


(b) No cosets overlap, i.e. they are all disjoint. We show that by showing that given any element of a 
coset, we can generate all the other elements of the coset. Consider an element a = gs, in the coset w.r.t. g. 
Then for any subgroup element s, as is also in the coset with a: 


as =(gs,)s = 8 (sq), and syseS. 


Now, let s take on all values in S, then a-s produces a distinct list of coset members (again from invertibility). 
In fact, a-s produces all of the coset members, one for each value of s. Thus, if two cosets contain the same 
member a, they are identical. So different cosets must be disjoint. 


(c) Every element h of the group G appears in some coset. To be in a coset, we must have h = gs, for 
some g in G, and some s in S. To prove that a given h appears in some coset, we pick an arbitrary s in S, and 
solve for g: g = hs'. Thus every element of G appears in some coset, and since cosets don’t overlap, every 
element of G appears in exactly one coset. 


[In fact, we can choose s to be the identity, then g = h: every element h appears in the coset with respect 
to itself. ] 


Lagrange’s Theorem: Putting the 3 properties together, we find that all the elements of G are distributed 
into equal-sized cosets, so the size of the cosets divides the size of G. And the size of S is the size of the 
cosets, ergo the size of S divides the size of G. QED. 


Public Key Cryptography on a Hand Calculator 


Public key cryptography is increasingly important in our digital world. The theory is clever, but 
understandable with only high school math. We present here the theory, along with simple numerical 
examples. We describe the RSA algorithm in detail, and give insight into the number theory behind it. 
(However, we do not prove the well-known number theorems.) Finally, get your name in lights by being the 
first to decrypt the message at the end, and send me an encrypted reply. 


This section assumes some familiarity with modular arithmetic. 


Public Key Cryptography Overview 


Figure 18.1 shows old-fashioned cryptography, where the receiver must share a secret key with the 
sender. This requires a secure channel be temporarily established ahead of time to share the secret, which 
can be inconvenient. 
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Figure 18.1 Old-fashioned, secret key cryptography. 


Figure 18.2 shows public key cryptography, where the encryption and decryption keys are different. The 
receiver generates a matched-pair of keys: one private and one public. She publishes the public key widely. 
It allows anyone to encrypt a message to her, but gives no useful information on how to decrypt the message. 
The receiver keeps the secret key private; it allows decryption. The sender uses the public key to encrypt a 
message, and sends over public (insecure) channels to the receiver. The receiver decrypts the message with 
the private key. 


Encryption 
function 


receiver! 


| plaintext plaintext | 


public 


Figure 18.2 Public key cryptography. 


The most widely used public key cryptography method is the RSA algorithm (Rivest, Shamir, Adleman). 
It was patented in 1983, but was released to the public domain in 2000, just before the patent expired [Wik0O1]. 


RSA Overview 


The RSA public key comprises two integers (e, n): an exponent e, and a modulus n. For a message m < 
n, we encrypt it into cyphertext c as: 


c=m* modn. 
The private key is another integer exponent d such that: 
d 
cf =(m*) =m“ =m (modn) . (18.1) 
Many equations are written with “(mod n)” at the end (as in (18.1)), which denotes that both sides of all 


equals signs are taken mod n. 


RSA is secure because there exist keys (e, ) for which it is computational infeasible to find m given c, 
e, and n. RSA is practical because it is computationally feasible to find such sets of (e, n, d), and to perform 
the encryption and decryption operations. In practice, n has around 600 digits (2048 bits), e ~ 5 digits, and 
d ~ 600 digits. 


Examples of Relevant Modular Arithmetic 


Modular arithmetic is used in lots of cryptographic algorithms. It is easy to compute, and simple to 
understand. All numbers are integers, and for RSA, positive. Modular arithmetic’s key properties (get it?) 
are: 
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e xmodn=remainder of x/n, e.g. 16mod15 =1 
e (ab mod n) — (amodn)(bmodn)modn. 


e (a°)’ modn = (a° mod n)’ modn. 


In other words, when doing a big calculation that will be taken “mod n” at the end, you can save calculation 
time by taking intermediate results “mod n” along the way. 


The RSA algorithm is based on the difficulty of (a) modular roots of large numbers, and (b) factoring 
large numbers. 


multiples of 5 
0/1)2 | a @|7/8 t hn 


multiples of 3 


(a) (b) 


Figure 18.3 (a) Arithmetic mod 15. The 8 numbers with no common factor with 15 are highlighted. 
(b) Striking out numbers with common factors with 15. 


Let’s look at some simple examples of modular powers: say, some powers of integers mod 15 (n = 15): 
2° mod 15 = 1, 2! mod 15 =2, 2? mod 15 =4, 2? mod 15 =8, 2+ mod 15= 1. 


At this point, the powers of two simply repeat, because 2* mod 15 = 2°= 1. The period of repetition is T = 
4. 


Here’s another (from now on, the “mod 15” is understood): 
Pal, 7=7,7=4,7=13, al. T=4. 

And another: 

11°=1, 1l'=11, 12? =1. T=2. 


From group theory, it can be shown that for integers m that have no common factor with 15 (i.e., 1, 2, 4, 7, 
8, 11, 13, 14), the period is at most 8, but might be 4 or 2 (more later). For n = 15, it turns out that the longest 
period is 4. This means for all such m, m> =m. But so does m? = m, and m***! = m for any integer k > 0. 


What about the m with a common factor with n = 15? Let’s see: 
3°=1, 3'=3,3°=9, 39 =12, 34=6, BP =3. T=4. 


The sequence repeats, but never returns to 1. Any power of 3 is a multiple of 3, and any multiple of 15 is a 
multiple of 3, so 3° can never have a remainder of 1, i.e., 3° mod 15 can never be 1. In this case, the period 
is 4, and 3° = 3, much like the m with no common factor. The same argument applies to any m with acommon 
factor with n. It might seem like a coincidence that m> = m even for m with a common factor with n, but it 
can be proved that this is always true [WikO1]. This is of theoretical interest, but in practice, it never happens. 
With n ~ 600 digits, it has about 10° factors, out of 10°° possible messages m. Your chance of hitting a 
factor with your message is about | in 10°”, which may explain why this special case of RSA is rarely 
discussed in the literature. (It is mentioned in the original paper [RSA77 p7].) 


k+l 


The big result is this: for any m <n = 15, m =m, k any integer > 0. 
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Application to RSA Cryptosystem 


RSA applies the results of the previous section to a public key cryptosystem. For a message m < n, we 
encrypt it by choosing a fixed exponent e, and defining the encrypted cyphertext to be: 


c=m* modn. 


For example, we might choose e = 3, and then c = m* mod n. Continuing our example of n = 15, if our 
message is 7, our encrypted codeword is c = 7° = 13. To decrypt it, we find the exponent d such that: 


d 
c4 =(m*) =m“ =m > forn =15,ed =k-4+1. 
For e = 3, d=3 gives ed =9, so our decryption exponent is d = 3 (that d = e is an artifact of our unrealistically 
small integers, and never happens in practice). The 


If our message m > n, we break it into smaller chunks, such that each chunk < n, and encrypt each chunk 
separately. 


As another example, take n = 35. One can show that the largest T= 12, and all periods divide 12. Thus 
m* 12+! = m, for all m < 35. We might again choose e = 3, and then must find d such that ed = k-12 + 1. But 
this is impossible, because all multiples of 12 are also multiples of 3, and so cannot be one less than a multiple 
of 3. We must choose e < 12 with no common factor with 12, say e =5. Now find d such that ed = k-12 + 
1. Trying multiples of 12 in order, we find that 24+1 =5-5,so d=5. (This is another annoying coincidence 
that does not happen in practice, but also happens with n = 35 for the other possible values of e: 7 and 11.) 


Another example: for n = 77, the maximum period is T = 30. Then we might choose an encryption 


13 
exponent e = 7. The decryption exponent is 13: for all m < 77, (m’) =m! =m (mod77). 


In general, a decode exponent d satisfies ed = k(n) + 1, or equivalently: 
ed mod ¢(n) =1, written as d=e! (mod g(n)) where e | =integer . 


In principle, this all you need to implement the RSA algorithm. The details are now in feasibly finding 
a set of numbers (e, n, d) for which the exponentiation is feasible, and modular roots are infeasible. Finding 
the set (e, n, d) is called key generation, and for n ~2048 bits, can be done in a few seconds on an ordinary 
computer (c. 2020). 


Raising to Large Powers 


Raising a number to a large power, such as 65537 or 10° seems like a daunting task. But raising to a 
power mod n is feasible. Arbitrary size arithmetic packages are widely available for computers (e.g., Python 
includes arbitrary sized integers in the language). Thus 600-digit multiplies and remainder (modulus) 
operations are readily attainable. 


One tool for evaluating large exponents is that after each multiply, we can take the result mod n, which 
keeps the size of the results within the size of n. Furthermore, squaring a number doubles its exponent: 


( z\. 2% 
m) =m". 
We can achieve any exponent by intermixing multiplies of m with the squarings. E.g.: 
2 7 6 2 - 7 65537 sqaured 16 times 
m-m) =m, |m°-m) m=m', m =(m) ‘mM 


Thus it takes only 17 multiplies to raise a number to the 65537" power. For decoding, d ~ n or ~2,000 binary 
bits. Therefore, squaring a number 2,000 times raises it to the 27°°th power. Totally feasible. 


Binary bit-twiddlers will recognize that a general algorithm to raise m to the x" power starts by writing 
xin ered e.g. x = 1011 1101. Then: 


= 1. // cunning accumulated product 
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For each bit of x from left to right: 
r= r*r 
if current bit == 1: r = r*m 


The computation time is only O(log? x). 


Finding the Maximum Period 


For decoding, we need to know 7, the maximum period of a power sequence mod n. In our trivial 
examples, we found T by trial and error. With 10° possibilities, that is not feasible. Number theory comes 
to the rescue. The Euler totient function ¢(7) gives either T, or a multiple of T, for any power sequence mod 
n. Either way, ¢(n) is always a period. ¢(n) is the number of numbers | < r <n -— 1 that have no common 
factor with n. Figure 18.2b shows this for n = 15, where ¢(15) = 8. For n which are the product of two 
primes, the totient function is easy to calculate (Figure 18.2b): 


We start with all the natural numbers < 15 = 3-5. There are 14 of them. 

Then cross out all the multiples of 3; there are 15/3 — 1 = 4 of them. 

Then cross out all the multiples of 5; there are 15/5 — 1 = 2 of them. 

4. There are 14—4-—2= 8 numbers left, which have no common factor with 15. 


Soa ae ae 


Generalizing, for n = pq, with p, q both prime: 


o(n) =(n 1) n/ p-1l nig 1j=n-p-q+tl [RSA77 6 p6]. 
q P 


Then the decode exponent d satisfies ed = kg(n) + 1, or equivalently: 
ed mod g(n) =1,_ written as d=e! (mod g(n)) where e | =integer . 


For any two integers a, b, the extended Euclid algorithm efficiently finds a! mod b, and so can find the 
required d = e'. We do not describe the extended Euclid algorithm here. 


In practice, ¢() might be larger than needed, since it may be a multiple of T. Then d above is slightly 
inefficient. Instead, the slightly more complicated Carmichael function A(n) gives the maximum period T 
exactly [Wik02]. The optimum d = e! (mod A(n)). 


Finding Large Primes: It’s a Gamble 


The previous section shows that, to be able to compute d, we must choose an n which is the product of 
two large primes, p and g. How do we find such primes? Guess and check: guess a number at random, and 
test it for being prime. The density of prime numbers decreases approximately like I/log p, and for 1000- 
digit numbers, about | in 2,000 is prime. 


For each guess, number theory is essential to test for “primality”. There are efficient algorithms for 
statistical primality tests. To get a feel for how such tests work, recall Fermat’s little theorem: if p is prime, 


then for any integer b, b? =b (mod P) . (This is related to the totient function theorem.) For a random guess 


r, we can choose any base b, and compute b’ mod r. If the result # b, then ris not prime, and we guess again. 
If r passes the test, it may or may not be prime, i.e. our test may give a “false positive”. Number theory can 
put upper bounds on the fraction of tests f< 1 yielding false positive. If we perform g independent tests, then 
the probability of a false positive on all of them is < f8. Thus we can make our probability of error as small 
as we wish, by choosing g as large as we wish. It is easily feasible to make our chance of error < 107!. 


Thus we cannot prove that our “primes” p and q are indeed prime, but we can make the chance of error 
so small as to be negligible. 


RSA By the Numbers 
Putting all the above together, an RSA matched key pair can be generated as follows: 


1. The receiver finds two large primes, p and q, and defines n = pq. 
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2. eis typically chosen as 65 537, which is prime, and has no common factors with anything. 
3. d=e! (mod A(n)). 


The receiver publishes (e, 7) in a public directory for all to see. She keeps d, p, and q secret. In fact, she can 
destroy p and q forever, since only d is needed going forward. 


Practical Note on Efficiency 


RSA is slow to compute: large exponents are costly. In practice, to save compute time, a sender with a 
large message usually encrypts it with a standard, fast algorithm (e.g., AES-256) using a randomly generated 
one-time key. The sender then RSA encrypts this key, and sends both the message cyphertext and the 
publicly-encrypted one-time key to the receiver. 


Digital Signatures 


d e 
Since (m°) = (m*) , RSA can be used for a digital signature: a number that proves the message came 


from the claimed sender. Suppose Bob sends Alice a message. Since it is rather personal, he wants Alice to 
know the message came from him, and no one is impersonating him. So he actually encrypts his message 
first with his secret decode exponent. Then encrypts it with Alice’s public encode key. Alice decrypts it first 
with her private decode key, and then decodes it again with Bob’s public encode exponent. Only Bob could 
know the secret d needed to make this last decoding work. Therefore, Bob’s message is both private, and 
cannot be forged. 


However, since Bob is rather wordy when expressing his feelings, his missive is too long to be efficiently 
encoded with RSA. For a more efficient signature, he computes a short cryptographic hash of his message 
(e.g., MDS is a standard 128-bit hash). He then encrypts this hash with his secret decode key, and then with 
Alice’s public key. Alice can verify that the signed hash matches the secret message, and therefore it must 
have come from Bob. Furthermore, Alice is free to publicly reveal Bob’s RSA signature of the hash, so she 
can prove in court that Bob sent the message. Bob can never again deny his feelings. 


Some experts recommend using a different key-pair for digital signing than for your received message 
public-key encryption, to minimize the exposure of encrypted messages to would-be attackers. 


Vulnerabilities 


Cryptographic security is very intricate and difficult. We can only touch on a few issues here, and we 
are not cryptographic experts. To protect your data, carefully follow the advice of cryptographic experts. 


The biggest vulnerability is always hubris: people without extensive education and experience who 
invent their own, or modify, cryptosystems. (See IEEE’s WiFi WEP debacle for a particularly egregious 
example of this.) 


A mathematical breakthrough in quickly finding either modular roots of large numbers, or factors of 
large numbers, would break RSA. RSA has been known since 1977 [RSA77], and many people have tried 
to break it, but it remains secure as of 2020. Note that another public key method of the same era, the 
knapsack algorithm, has been broken [ref??]. 


Replay attack: An intruder can see an encrypted message. They may be able to inject a copy of this 
message at a later time, and thus fool the receiver. For example, if a genuine message says “Please send me 
$100. -Bob”, an intruder could replay that at a later time, tricking Alice into sending another $100. Replays 
are usually prevented by having the sender put a sequence number (such as a time-stamp) in each message. 
The genuine sender always increments the sequence number, so the receiver will recognize an old replayed 
message as invalid. However, requiring a sequence number introduces “state” in the communication channel: 
the receiver must keep track of a separate sequence number for each sender. 


RSA has many vulnerabilities for poorly chosen values of p and q, which are beyond our scope [Wik01]. 
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Secret Message 


To get your name in lights, be the first to decrypt this secret message, and send me an encrypted reply 
(show your work for several decryptions and encryptions). Encrypt each character separately. The 
“alphabet” is space = 2, A - Z = 3 - 28, (e, n) =(7, 77): 


39 10 69 21 11 47 21 51 48 23 42 28 21 
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Glossary 


Definitions of common mathematical physics terms. “Special” mathematical definitions are noted by 
“(math)”. These are technical mathematical terms that you shouldn’t have to know, but will make reading 
math books a lot easier because they are very common. These definitions try to be conceptual and accurate, 
but also comprehensible to “normal” people (including physicists, but not mathematicians). 


1-1 A mapping from a set A to a set B is 1-1 if every value of B under the map has only one 
value of A that maps to it. In other words, given the value of B under the map, we can 
uniquely find the value of A which maps to it. Equivalently, Ma = Mb implies a = b. 
However, see “1-1 correspondence.” See also “injection.” A 1-1 mapping is invertible. 


1-1 correspondence A mapping, between two sets A and B, is a 1-1 correspondence if it uniquely 
associates each value of A with a value of B, and each value of B with a value of A. 
Synonym: bijection. 


accumulation point syn. for limit point. 
adjoint! The adjoint of an operator produces a bra from a bra in the same way the original operator 
produces a ket from a ket: Oly) =|¢) = (y|Or =(¢|, Vy). The adjoint of an 


operator is the operator which preserves the inner product of two vectors as <v|-(O|w>) = 
(O'|v>)?-|w>. The adjoint of an operator matrix is the conjugate-transpose. This has 
nothing to do with matrix adjoints (below). 


adjoint? In matrices, the transpose of the cofactor matrix is called the adjoint of a matrix. This has 
nothing to do with linear operator adjoints (above). 


adjugate the transpose of the cofactor matrix: adj(A)j = Cj =(-1)""Mji, where Mj; is the transpose 
of the minor matrix: Mj = det(A deleting row i and column j). 


analytic A function is analytic in some domain IFF it has continuous derivatives to all orders, i.e. is 
infinitely differentiable. For complex functions of complex variables, if a function has a 
continuous first derivative in some region, then it has continuous derivatives to all orders, 
and is therefore analytic. 


analytic geometry the use of coordinate systems along with algebra and calculus to study geometry. 
Aka “coordinate geometry” 


bijection Both an “injection” and a “surjection,” i.e. 1-1 and “onto.” A mapping between sets A and 
B is a bijection iff it uniquely associates a value of A with every value of B. Synonym: 1- 
1 correspondence. 


BLUE In statistics, Best Linear Unbiased Estimator. 


branch point A branch point is a point in the domain of a complex function f(z), z also complex, with 
this property: when z traverses a closed path around the branch point, following continuous 
values of f(z), f(z) has a different value at the end of the path than at the beginning, even 
though the beginning and end point are the same point in the domain. Example TBS: 
square root around the origin. 


boundary point (math) see “limit point.” 


by definition in the very nature of the definition itself, without requiring any logical steps. To be 
distinguished from “by implication.” 


by implication By combining the definition with other true statements, a conclusion can be shown by 
implication. 
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CorC the set of complex numbers. 


closed (math) contains any limit points. For finite regions, a closed region includes its boundary. 
Note that in math talk, a set can be both open and closed! The surface of a sphere is open 
(every point has a neighborhood in the surface), and closed (no excluded limit points; in 
fact, no limit points). 


cofactor The ij-th minor of an nxn matrix is the determinant of the (n—1)x(m—1) matrix formed by 
crossing out the i-th row and j-th column. A cofactor is just a minor with a plus or minus 
sign affixed, according to whether (i, j) is an even or odd number of steps away from (1,1): 


i+j 
C, =)! M, 


compact (math) for our purposes, closed and bounded [Tay thm 2-6I p66]. A compact region may 
comprise multiple (infinite number??) disjoint closed and bounded regions. 


congruence a set of 1D non-intersecting curves that cover every point of a manifold. Equivalently, a 
foliation of a manifold with 1D curves. Compare to “foliation.” 


contrapositive The contrapositive of the statement “If A then B” is “If not B then not A.” The 
contrapositive is equivalent to the statement: if the statement is true (or false), the 
contrapositive is true (or false). If the contrapositive is true (or false), the statement is true 
(or false). 


convergent approaches a definite limit. For functions, this usually means “convergent in the mean.” 


convergent inthe mean a function f(x) is convergent in the mean to a limit function L(x) IFF the mean- 
squared error approaches zero. Cf “uniformly convergent”. 


converse The converse of the statement “If A then B” is “If B then A”. In general, if a statement is 
true, its converse may be either true or false. The converse is the contrapositive of the 
inverse, and hence the converse and inverse are equivalent statements. 


connected There exists a continuous path between any two points in the set (region). See also: simply 
connected. [One p178]. 


coordinate geometry the use of coordinate systems along with algebra and calculus to study geometry. 
Aka “analytic geometry”. 


decreasing non-increasing: a function is decreasing IFF for all b > a, f(b)<f(a). Note that 
“monotonically decreasing” is the same as “decreasing”. This may be restricted to an 
interval, e.g. f(x) is decreasing on [0, 1]. Compare “strictly decreasing”. 


device a procedure that’s only used twice. See “trick” and “method.” 


diffeomorphism a C* (1-1) map, with a C” inverse, from one manifold onto another. “Onto” implies the 
mapping covers the whole range manifold. Two diffeomorphic manifolds are topologically 
identical, but may have different geometries. 


divergent not convergent: a sequence is divergent IFF it is not convergent. 
domain of a function: the set of numbers (usually real or complex) on which the function is defined. 


elliptic operator A second-order differential operator in multiple dimensions, whose second-order 
coefficient matrix has eigenvalues all of the same algebraic sign (none zero). 


entire A complex function is entire IFF it is analytic over the entire complex plane. An entire 
function is also called an “integral function.” 


essential singularity a “pole” of infinite order, i.e. a singularity around which the function is 
unbounded, and cannot be made finite by multiplication by any power of (z — zo) [Det 
p165]. 


expected value a misnomer for “average.” 
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fact A small piece of information backed by solid evidence (in hard science, usually repeatable 
evidence). If someone disputes a fact, it is still a fact. “If a thousand people say a foolish 
thing, it is still a foolish thing.” See also “speculation,” and “theory.” 


factor a number (or more general object) that is multiplied with others. E.g., in “(a+ b)(x + y)”, 
there are two factors: “(a + by’, and “(x + y)”. 

finite a non-zero number. In other words, not zero, and not infinity. 

foliation a set of non-intersecting submanifolds that cover every point of a manifold. E.g., 3D real 


space can be foliated into 2D “sheets stacked on top of each other,” or 1D curves packed 
around each other. Compare to “congruence.” 


holomorphic syn. for analytic. Other synonyms are regular, and differentiable. Also, a “holomorphic 
map” is just an analytic function. 


homomorphic something from abstract categories that should not be confused with homeomorphism. 


homeomorphism a continuous (1-1) map, with a continuous inverse, from one manifold onto another. 
“Onto” implies the mapping covers the whole range manifold. A homeomorphism that 
preserves distance is an isometry. 


hyperbolic operator A second-order differential operator in multiple dimensions, whose second-order 
coefficient matrix has non-zero eigenvalues, with one of opposite algebraic sign to all the 
others. 

identify to establish a 1-1 and onto relationship. If we identify two mathematical things, they are 
essentially the same thing. 

IFF (or iff) if, and only if. 

injection A mapping from a set A to a set B is an injection if it is 1-1, that is, if a value of B in the 


mapping uniquely determines the value of A which maps to it. Note that every value of A 
is included by the definition of “mapping” [CRC 30], but the mapping does not have to 
cover all the elements of B. 


integral function Syn. for “entire function:” a function that is analytic over the entire complex plane. 


inverse The inverse of the statement “If A then B” is “If not A then not B.” In general, if a statement 
is true, its inverse may be either true or false. The inverse is the contrapositive of the 
converse, and hence the converse and inverse are equivalent statements. 


invertible A map (or function) from a set A to a set B is invertible iff for every value in B, there exists 
a unique value in A which maps to it. In other words, a map is invertible iff it is a bijection. 


isolated singularity a singularity at a point, which has a surrounding neighborhood of analyticity [Det 
p165]. 
isometry a homeomorphism that preserves distance, i.e. a continuous, invertible (1-1) map from one 


manifold onto another that preserves distance (“onto” in the mathematical sense). 
isomorphic “same structure.” A widely used general term, with no single precise definition. 


limit point of a domain is a boundary of a region of the domain: for example, the open interval (0, 1) 
on the number line and the closed interval [0, 1] both have limit points of 0 and 1. In this 
case, the open interval excludes its limit points; the closed interval includes them 
(definition of “closed”). Some definitions define all points in the domain as also limit 
points. Formally, a point p is a limit point of domain D iff every open subset containing p 
also contains a point in D other than p. 


mantissa the number part of scientific notation: e.g., 3.14x107 has a mantissa of 3.14. In binary, 


1.012" has a mantissa of 1.01. 
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mapping syn. “function.” A mapping from a set A to a set B defines a value of B for every value of 
A [CRC 30" ]. 


meromorphic A function is meromorphic on a domain IFF it is analytic except at a set of isolated poles 
of finite order (i.e., non-essential poles). Note that branch points are nonanalytic points, 
but some are not poles (such as Vz at zero), so a function including such a branch point is 
not meromorphic. 


method a procedure that’s used 3 or more times. See “trick” and “device.” 


minor The jj-th minor of an nxn matrix is the determinant of the (n—1)x(nm—1) matrix formed by 
crossing out the i-th row and j-th column, i.e., the minor matrix: Mj = det(A deleting row i 
and column j). See also “cofactor.” 


monotonic all the same: a monotonic function is either increasing or decreasing (on a given interval). 
Note that “monotonically increasing” is the same as “increasing”. See also “increasing” 
and “decreasing”. 


N the set of natural numbers (positive integers). 

noise unpredictable variations in a quantity. 

oblique non-orthogonal and not parallel. 

one-to-one see “1-1”. 

onto covering every possible value. A mapping from a set A onto the set B covers every possible 


value of B, i.e. the mapping is a surjection. 


open (math) An region is open iff every point in the region has a finite neighborhood of points 
around it that are also all in the region. In other words, every point is an interior point. 
Note that open is not “not closed;” a region can be both open and closed. 


OS operating system, e.g. Unix, OS X, Windows. 


parabolic operator A second-order differential operator in multiple dimensions, whose second-order 
coefficient matrix has at least one zero eigenvalue. 


PDM phase dispersion minimization: an algorithm for finding arbitrarily shaped periodic signals 
in a (t, y), or (x, y), data set. 

pole a singularity near which a function is unbounded, but which becomes finite by 
multiplication by (z — zo) for some finite k [Det p165]. The value k is called the order of 
the pole. 


positive definite a matrix or operator which is > 0 for all non-zero operands. It may be 0 when acting on a 
“zero” operand, such as the zero vector. This implies that all eigenvalues > 0. 


positive semidefinite a matrix or operator which is = 0 for all non-zero operands. It may be 0 when 
acting on a non-zero operands. This implies that all eigenvalues > 0. 


predictor in regression: a variable put into a model to predict another value, e.g. ymoa(*1, x2) is a model 
(function) of the predictors x; and x2. 


PT perturbation theory. 

QorQ the set of rational numbers. Q* = the set of positive rationals. 

RorR the set of real numbers. 

RMS root-mean-square. 

RV random variable. 

removable singularity an isolated singularity that can be made analytic by simply defining a value for 


the function at that point. For example, f(x) = sin(x)/x has a singularity at x = 0. You can 
remove it by defining f(0) = 1. Then fis everywhere analytic. [Det p165] 


4/27/2021 11:49AM — Copyright 2002-2021 Eric L. Michelsen. All rights reserved. 320 of 322 


elmichelsen.physics.ucsd.edu/ | Funky Mathematical Physics Concepts emichels at physics.ucsd.edu 


residue The residue of a complex function at a complex point Zo is the a_; coefficient of the Laurent 
expansion about the point Zo. 


significand IEEE word for “mantissa.” 


simply connected There are no holes in the set (region), not even point holes. Le., you can shrink any closed 
curve in the region down to a point, the curve staying always within the region (including 
at the point). 


singularity of a function: a point on a boundary (i.e. a limit point) of the domain of analyticity, but 
where the function is not analytic. [Det def 4.5.2 p156]. Note that the function may be 
defined at the singularity, but it is not analytic there. E.g., Vz is continuous at 0, but not 
differentiable. 


smooth for most references, “smooth” means infinitely differentiable, i.e. C*. For some, though, 
“smooth” means at least one continuous derivative, i.e. C!, meaning first derivative 
continuous. This latter definition looks “smooth” to our eye (no kinks, or sharp points). 


speculation A guess, possibly hinted at by evidence, but not well supported. Every scientific fact and 
theory started as a speculation. See also “fact,” and “theory.” 


STL Standard Template Library, a set of libraries in the C++ language. 


strictly decreasing a function is strictly decreasing IFF for all b > a, f(b) < fia). This may be restricted 
to an interval, e.g. f(x) is strictly decreasing on [0, 1]. Compare “decreasing”. 


strictly increasinga function is strictly increasing IFF for all a < b, fia) < f(b). This may be restricted to an 
interval, e.g. f(x) is strictly increasing on [0, 1]. Compare “increasing”. 


surjection A mapping from a set A “onto” the set B, i.e. that covers every possible value of B. Note 
that every value of A is included by the definition of “mapping” [CRC 30], however 
multiple values of A may map to the same value of B. 


term a number (or more general object) that is added to others. E.g., in “ax + by + cz”, there are 
three terms: “ax”, “by”, and “cz”. 


theory the highest level of scientific achievement: a quantitative, predictive, testable model 
which unifies and relates a body of facts. A theory becomes accepted science only after 
being supported by overwhelming evidence. A theory is not a speculation, e.g. Maxwell’s 
electromagnetic theory. See also “fact,” and “speculation.” 


trace the trace of a square matrix is the sum of its diagonal elements. 
trick a procedure that’s only used once. See “device” and “method.” 
uniform convergence a sequence of functions f,(z) is uniformly convergent in an open (or partly open) 


region IFF its error ¢ after the N™ function can be made arbitrarily small with a single value 
of N (dependent only on «) for every point in the region. Le. given ¢, a single N works for 
all points z in the region [Chu p156]. 


voila French contraction of “see there!” 
WLOG or WOLOG without loss of generality 
ZorZ the set of integers. Z* or N = the set of positive integers (natural numbers). 
Formulas 
2 2 
completing the square: ax’ +bx =a (« + 2. aa 7 (x-shift = —b / 2a) 
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Definite Integrals 
° dx ow = A : dx ew 7 my ca [oa pew = a 
—00 a 0 2a 
Statistical distributions 
un : avg =V ao’ =Wv 
PaZ 


exponential : avg =T 


error function [A&S]: erf (x) = as dt 


2 x 
—|-e 
Teo 
gaussian included probability between —z and +z: 


4 1 3 
P gaussian (<) = } Pdf gaussian(U) du = se! et du Let u?/2=t°, du=~2dt 
=z me-z 


= eal J2dt = erf (z//2) 


Special Functions 
T(n) =(n-1)! Pays [dex te* (a) =(a-1)T(a-1) T(1/2) = Vx 


The functions below use the Condon-Shortley phase: 


m |(2/+1) (L-m)! ge 

PB, é : 20, 
( ) 2 (l+m)! im(COS le a 
Yin (8.9) = | 
(21 + 1) (J - lm) ect eind a 
=a aee cos 0) ——, _ m<0O, 
2 (t+|m): V2 
P,,(x) is the associated Legendre function, 
1=0,1,2..., m=-—l,—-1+1,..1-Ll. [Wyl 3.6.5 p96] 


Index 


The index is not yet developed, so go to the web page on the front cover, and text-search in this document. 
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